Speedup of relation deletes during recovery

Started by Fujii Masaoalmost 8 years ago38 messages

masao.fujii@gmail.com

almost 8 years ago

1 attachment(s)

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v1.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v1.patchDownload

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 5641,5646 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5641,5648 ----
  	/* Make sure files supposed to be dropped are dropped */
  	if (parsed->nrels > 0)
  	{
+ 		SMgrRelation *srels = NULL;
+ 
  		/*
  		 * First update minimum recovery point to cover this WAL record. Once
  		 * a relation is deleted, there's no going back. The buffer manager
***************
*** 5658,5663 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5660,5666 ----
  		 */
  		XLogFlush(lsn);
  
+ 		srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  		for (i = 0; i < parsed->nrels; i++)
  		{
  			SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5665,5673 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
  		}
  	}
  
  	/*
--- 5668,5681 ----
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			srels[i] = srel;
  		}
+ 
+ 		smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 		for (i = 0; i < parsed->nrels; i++)
+ 			smgrclose(srels[i]);
+ 		pfree(srels);
  	}
  
  	/*
***************
*** 5708,5713 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5716,5722 ----
  {
  	int			i;
  	TransactionId max_xid;
+ 	SMgrRelation *srels = NULL;
  
  	Assert(TransactionIdIsValid(xid));
  
***************
*** 5771,5776 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5780,5786 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
+ 	srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  	for (i = 0; i < parsed->nrels; i++)
  	{
  		SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5778,5786 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
  	}
  }
  
  void
--- 5788,5801 ----
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		srels[i] = srel;
  	}
+ 
+ 	smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 	for (i = 0; i < parsed->nrels; i++)
+ 		smgrclose(srels[i]);
+ 	pfree(srels);
  }
  
  void

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 8 years ago

In reply to: Fujii Masao (#1)

Re: Speedup of relation deletes during recovery

Hello.

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

Yeah, I was just going to say the same after looking at Fujii-san's
patch. This would also cause smgrdounlink() to become unused in the
core code. So this could just be... Removed?

/me Preparing shelter against incoming wraith.

That's not v11 material at this point in any case.
--
Michael

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 8 years ago

In reply to: Fujii Masao (#1)

RE: Speedup of relation deletes during recovery

From: Fujii Masao [mailto:masao.fujii@gmail.com]

When multiple relations are deleted at the same transaction, the files of
those relations are deleted by one call to smgrdounlinkall(), which leads
to scan whole shared_buffers only one time. OTOH, during recovery,
smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much especially
when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Nice catch. As Horiguchi-san and Michael already commented, the patch looks good.

As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possibly TRUNCATE as well.) How should we proceed with this?

/messages/by-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATE does that for each fork, resulting in three scans per table. How about changing the following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock, int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that.

Regards
Takayuki Tsunakawa

Jerry Sievers

gsievers19@comcast.net

almost 8 years ago

In reply to: Fujii Masao (#1)

Re: Speedup of relation deletes during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

Wonder if this is the case for streaming standbys replaying truncates
also?

Just couple days ago I was running a pg_upgrade test scenario but did
not reach the point of upgrade yet.

We made snapshots of our large reporting system putting them into a
master and 1-slave streaming replication configuration.

There are hundreds of unlogged tables that we wish to trunc before the
upgrade to save time during the rsyncing of standbys, since a standard
rsync replicates all data which is timeconsumeing and useless sending to
the standbys.

Indeed the tables are already trunc'd on the test master since it too
was a recovered system upon initial start but the truncates took place
anyway since it's part of our framework. It ran in just a few seconds
on master.

The standby however was $slow replaying these truncates which we noticed
because the upgrade wil not proceed until master and 1 or more standbys
are confirmed all at same checkpoint.

I straced the standby's startup process to find it unlinking lots of
tables, getting -1 on the unlink syscall since the non-init fork files
were already missing.

I can't describe just how slow if was but took minutes and the lines
being output by strace were *not* blowing up my display as happens
generally when any busy process is straced.

And, the test systems were config'd with $large shared buffers of 64G.

Please advise

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Jerry Sievers

gsievers19@comcast.net

almost 8 years ago

In reply to: Fujii Masao (#1)

Re: Speedup of relation deletes during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

Forgot to mention version in my TLDR prev reply :-)

version
------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu (Ubuntu 9.5.12-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
(1 row)

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 8 years ago

In reply to: Jerry Sievers (#5)

RE: Speedup of relation deletes during recovery

From: Jerry Sievers [mailto:gsievers19@comcast.net]

Wonder if this is the case for streaming standbys replaying truncates
also?

Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE.

Regards
Takayuki Tsunakawa

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Michael Paquier (#3)

3 attachment(s)

Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 11:46 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

Thanks for the review! I also changed FinishPreparedTransaction() so that
it uses smgrdounlinkall(). Patch attached.

Yeah, I was just going to say the same after looking at Fujii-san's
patch. This would also cause smgrdounlink() to become unused in the
core code. So this could just be... Removed?

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1]/messages/by-id/1471.1339106082@sss.pgh.pa.us, this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

[1]: /messages/by-id/1471.1339106082@sss.pgh.pa.us
/messages/by-id/1471.1339106082@sss.pgh.pa.us

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v2.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v2.patchDownload

*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1445,1450 **** FinishPreparedTransaction(const char *gid, bool isCommit)
--- 1445,1451 ----
  	int			ndelrels;
  	SharedInvalidationMessage *invalmsgs;
  	int			i;
+ 	SMgrRelation *srels = NULL;
  
  	/*
  	 * Validate the GID, and lock the GXACT to ensure that two backends do not
***************
*** 1534,1546 **** FinishPreparedTransaction(const char *gid, bool isCommit)
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
  	for (i = 0; i < ndelrels; i++)
! 	{
! 		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
  
! 		smgrdounlink(srel, false);
! 		smgrclose(srel);
! 	}
  
  	/*
  	 * Handle cache invalidation messages.
--- 1535,1550 ----
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
+ 
+ 	srels = palloc(sizeof(SMgrRelation) * ndelrels);
  	for (i = 0; i < ndelrels; i++)
! 		srels[i] = smgropen(delrels[i], InvalidBackendId);
  
! 	smgrdounlinkall(srels, ndelrels, false);
! 
! 	for (i = 0; i < ndelrels; i++)
! 		smgrclose(srels[i]);
! 	pfree(srels);
  
  	/*
  	 * Handle cache invalidation messages.
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 5640,5645 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5640,5647 ----
  	/* Make sure files supposed to be dropped are dropped */
  	if (parsed->nrels > 0)
  	{
+ 		SMgrRelation *srels = NULL;
+ 
  		/*
  		 * First update minimum recovery point to cover this WAL record. Once
  		 * a relation is deleted, there's no going back. The buffer manager
***************
*** 5657,5662 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5659,5665 ----
  		 */
  		XLogFlush(lsn);
  
+ 		srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  		for (i = 0; i < parsed->nrels; i++)
  		{
  			SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5664,5672 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
  		}
  	}
  
  	/*
--- 5667,5680 ----
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			srels[i] = srel;
  		}
+ 
+ 		smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 		for (i = 0; i < parsed->nrels; i++)
+ 			smgrclose(srels[i]);
+ 		pfree(srels);
  	}
  
  	/*
***************
*** 5707,5712 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5715,5721 ----
  {
  	int			i;
  	TransactionId max_xid;
+ 	SMgrRelation *srels = NULL;
  
  	Assert(TransactionIdIsValid(xid));
  
***************
*** 5770,5775 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5779,5785 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
+ 	srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  	for (i = 0; i < parsed->nrels; i++)
  	{
  		SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5777,5785 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
  	}
  }
  
  void
--- 5787,5800 ----
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		srels[i] = srel;
  	}
+ 
+ 	smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 	for (i = 0; i < parsed->nrels; i++)
+ 		smgrclose(srels[i]);
+ 	pfree(srels);
  }
  
  void

remove_smgrdounlink_v1.patchapplication/octet-stream; name=remove_smgrdounlink_v1.patchDownload

*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 399,460 **** smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
  }
  
  /*
-  *	smgrdounlink() -- Immediately unlink all forks of a relation.
-  *
-  *		All forks of the relation are removed from the store.  This should
-  *		not be used during transactional operations, since it can't be undone.
-  *
-  *		If isRedo is true, it is okay for the underlying file(s) to be gone
-  *		already.
-  *
-  *		This is equivalent to calling smgrdounlinkfork for each fork, but
-  *		it's significantly quicker so should be preferred when possible.
-  */
- void
- smgrdounlink(SMgrRelation reln, bool isRedo)
- {
- 	RelFileNodeBackend rnode = reln->smgr_rnode;
- 	int			which = reln->smgr_which;
- 	ForkNumber	forknum;
- 
- 	/* Close the forks at smgr level */
- 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- 		smgrsw[which].smgr_close(reln, forknum);
- 
- 	/*
- 	 * Get rid of any remaining buffers for the relation.  bufmgr will just
- 	 * drop them without bothering to write the contents.
- 	 */
- 	DropRelFileNodesAllBuffers(&rnode, 1);
- 
- 	/*
- 	 * It'd be nice to tell the stats collector to forget it immediately, too.
- 	 * But we can't because we don't know the OID (and in cases involving
- 	 * relfilenode swaps, it's not always clear which table OID to forget,
- 	 * anyway).
- 	 */
- 
- 	/*
- 	 * Send a shared-inval message to force other backends to close any
- 	 * dangling smgr references they may have for this rel.  We should do this
- 	 * before starting the actual unlinking, in case we fail partway through
- 	 * that step.  Note that the sinval message will eventually come back to
- 	 * this backend, too, and thereby provide a backstop that we closed our
- 	 * own smgr rel.
- 	 */
- 	CacheInvalidateSmgr(rnode);
- 
- 	/*
- 	 * Delete the physical file(s).
- 	 *
- 	 * Note: smgr_unlink must treat deletion failure as a WARNING, not an
- 	 * ERROR, because we've already decided to commit or abort the current
- 	 * xact.
- 	 */
- 	smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
- }
- 
- /*
   *	smgrdounlinkall() -- Immediately unlink all forks of all given relations
   *
   *		All forks of all given relations are removed from the store.  This
--- 399,404 ----
***************
*** 463,471 **** smgrdounlink(SMgrRelation reln, bool isRedo)
   *
   *		If isRedo is true, it is okay for the underlying file(s) to be gone
   *		already.
-  *
-  *		This is equivalent to calling smgrdounlink for each relation, but it's
-  *		significantly quicker so should be preferred when possible.
   */
  void
  smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
--- 407,412 ----
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 89,95 **** extern void smgrclose(SMgrRelation reln);
  extern void smgrcloseall(void);
  extern void smgrclosenode(RelFileNodeBackend rnode);
  extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
- extern void smgrdounlink(SMgrRelation reln, bool isRedo);
  extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
  extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo);
  extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--- 89,94 ----

remove_smgrdounlinkfork_v1.patchapplication/octet-stream; name=remove_smgrdounlinkfork_v1.patchDownload

*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 477,534 **** smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
  }
  
  /*
-  *	smgrdounlinkfork() -- Immediately unlink one fork of a relation.
-  *
-  *		The specified fork of the relation is removed from the store.  This
-  *		should not be used during transactional operations, since it can't be
-  *		undone.
-  *
-  *		If isRedo is true, it is okay for the underlying file to be gone
-  *		already.
-  */
- void
- smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo)
- {
- 	RelFileNodeBackend rnode = reln->smgr_rnode;
- 	int			which = reln->smgr_which;
- 
- 	/* Close the fork at smgr level */
- 	smgrsw[which].smgr_close(reln, forknum);
- 
- 	/*
- 	 * Get rid of any remaining buffers for the fork.  bufmgr will just drop
- 	 * them without bothering to write the contents.
- 	 */
- 	DropRelFileNodeBuffers(rnode, forknum, 0);
- 
- 	/*
- 	 * It'd be nice to tell the stats collector to forget it immediately, too.
- 	 * But we can't because we don't know the OID (and in cases involving
- 	 * relfilenode swaps, it's not always clear which table OID to forget,
- 	 * anyway).
- 	 */
- 
- 	/*
- 	 * Send a shared-inval message to force other backends to close any
- 	 * dangling smgr references they may have for this rel.  We should do this
- 	 * before starting the actual unlinking, in case we fail partway through
- 	 * that step.  Note that the sinval message will eventually come back to
- 	 * this backend, too, and thereby provide a backstop that we closed our
- 	 * own smgr rel.
- 	 */
- 	CacheInvalidateSmgr(rnode);
- 
- 	/*
- 	 * Delete the physical file(s).
- 	 *
- 	 * Note: smgr_unlink must treat deletion failure as a WARNING, not an
- 	 * ERROR, because we've already decided to commit or abort the current
- 	 * xact.
- 	 */
- 	smgrsw[which].smgr_unlink(rnode, forknum, isRedo);
- }
- 
- /*
   *	smgrextend() -- Add a new block to a file.
   *
   *		The semantics are nearly the same as smgrwrite(): write at the
--- 477,482 ----
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 90,96 **** extern void smgrcloseall(void);
  extern void smgrclosenode(RelFileNodeBackend rnode);
  extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
  extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
- extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool isRedo);
  extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum, char *buffer, bool skipFsync);
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
--- 90,95 ----

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Tsunakawa, Takayuki (#4)

Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Fujii Masao [mailto:masao.fujii@gmail.com]

When multiple relations are deleted at the same transaction, the files of
those relations are deleted by one call to smgrdounlinkall(), which leads
to scan whole shared_buffers only one time. OTOH, during recovery,
smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much especially
when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Nice catch. As Horiguchi-san and Michael already commented, the patch looks good.

As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possibly TRUNCATE as well.) How should we proceed with this?

/messages/by-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATE does that for each fork, resulting in three scans per table. How about changing the following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock, int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that.

Yeah, it's worth working on this problem. To decrease the number of scans of
shared_buffers, you would need to change the order of truncations of files and
WAL logging. In RelationTruncate(), currently WAL is logged after FSM and VM
are truncated. IOW, with the patch, FSM and VM would need to be truncated after
WAL logging. You would need to check whether this reordering is valid.

Regards,

--
Fujii Masao

#10

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Fujii Masao (#8)

Re: Speedup of relation deletes during recovery

On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1], this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

Indeed, it's close to six years and the status is the same. So let's
drop it. I have been surrounding the area to see if any modules
actually use those, particularly on github, but I could not find
callers.

The patch looks logically fine to me. In your first message, you
mentioned that the replay time increased a lot. Do you have numbers to
share with some large settings of shared_buffers?

It would be better to wait for v12 branch to open before pushing
anything, as the focus is on stabililizing things on v11.
--
Michael

#11

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Michael Paquier (#10)

Re: Speedup of relation deletes during recovery

On Wed, Apr 18, 2018 at 10:44 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1], this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

Indeed, it's close to six years and the status is the same. So let's
drop it. I have been surrounding the area to see if any modules
actually use those, particularly on github, but I could not find
callers.

The patch looks logically fine to me. In your first message, you
mentioned that the replay time increased a lot. Do you have numbers to
share with some large settings of shared_buffers?

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

It would be better to wait for v12 branch to open before pushing
anything, as the focus is on stabililizing things on v11.

Yes, so I added this to next CommitFest.

Regards,

--
Fujii Masao

#12

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Fujii Masao (#11)

Re: Speedup of relation deletes during recovery

On Thu, Apr 19, 2018 at 01:52:26AM +0900, Fujii Masao wrote:

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

Hm, OK. Here are simple functions to create and drop many relations in
a single transaction:
create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
end loop;
end;
$$ language plpgsql;

create or replace function drop_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'drop table tab_' || i::text;
execute query_string;
end loop;
end;
$$ language plpgsql;

With a server running 8GB of shared buffers (you likely need to increase
max_locks_per_transaction) and 10k relations created and dropped, it
took 50 seconds to finish redo of roughly 4 segments:
2018-04-19 11:17:15 JST [7472]: [14-1] db=,user=,app=,client= LOG: redo
starts at 0/15DF2E8
2018-04-19 11:18:05 JST [7472]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E46160: wanted 24, got 0
2018-04-19 11:18:05 JST [7472]: [16-1] db=,user=,app=,client= LOG: redo
done at 0/4A7C6E

Then with your patch it took... Barely 1 second.
2018-04-19 11:20:33 JST [11690]: [14-1] db=,user=,app=,client= LOG:
redo starts at 0/15DF2E8
2018-04-19 11:20:34 JST [11690]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E299B0: wanted 24, got 0
2018-04-19 11:20:34 JST [11690]: [16-1] db=,user=,app=,client= LOG:
redo done at 0/4E29978

Looking at profiles with perf I can also see that 95% of the time is
spent in DropRelFileNodesAllBuffers which is where the startup process
spends most of its CPU. So HEAD is booh. And your patch is excellent
in this context.
--
Michael

#13

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 7 years ago

In reply to: Fujii Masao (#9)

RE: Speedup of relation deletes during recovery

From: Fujii Masao [mailto:masao.fujii@gmail.com]

Yeah, it's worth working on this problem. To decrease the number of scans
of
shared_buffers, you would need to change the order of truncations of files
and
WAL logging. In RelationTruncate(), currently WAL is logged after FSM and
VM
are truncated. IOW, with the patch, FSM and VM would need to be truncated
after
WAL logging. You would need to check whether this reordering is valid.

Sure, thank you for advice.

Takayuki Tsunakawa

#14

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Fujii Masao (#8)

Re: Speedup of relation deletes during recovery

Hi,

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very significantly.

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1445,6 +1445,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
int			ndelrels;
SharedInvalidationMessage *invalmsgs;
int			i;
+	SMgrRelation *srels = NULL;

/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1534,13 +1535,16 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
delrels = abortrels;
ndelrels = hdr->nabortrels;
}
+
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+		srels[i] = smgropen(delrels[i], InvalidBackendId);

-		smgrdounlink(srel, false);
-		smgrclose(srel);
-	}
+	smgrdounlinkall(srels, ndelrels, false);
+
+	for (i = 0; i < ndelrels; i++)
+		smgrclose(srels[i]);
+	pfree(srels);

This code is now duplicated three times - shouldn't we just add a
function that encapsulates dropping relations in a commit/abort record?

Greetings,

Andres Freund

#15

Teodor Sigaev

teodor@sigaev.ru

over 7 years ago

In reply to: Andres Freund (#14)

Re: Speedup of relation deletes during recovery

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very significantly.

+1, we faced with that too
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

#16

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#14)

Re: Speedup of relation deletes during recovery

Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:

+
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+		srels[i] = smgropen(delrels[i], InvalidBackendId);

-		smgrdounlink(srel, false);
-		smgrclose(srel);
-	}
+	smgrdounlinkall(srels, ndelrels, false);
+
+	for (i = 0; i < ndelrels; i++)
+		smgrclose(srels[i]);
+	pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?

Greetings,

Andres Freund

#17

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Andres Freund (#16)

Re: Speedup of relation deletes during recovery

On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
+
+   srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-   {
-           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+           srels[i] = smgropen(delrels[i], InvalidBackendId);
-           smgrdounlink(srel, false);
-           smgrclose(srel);
-   }
+   smgrdounlinkall(srels, ndelrels, false);
+
+   for (i = 0; i < ndelrels; i++)
+           smgrclose(srels[i]);
+   pfree(srels);
Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?

The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

for (i = ndelrels - 1; i >= 0; i--)
smgrclose(srels[i]);

instead of

for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);

Since add_to_unowned_list() always adds the entry into the head of
the list, each SMgrRelation entry is added into the "unowned" list in
descending order of its identifier, i.e., SMgrRelation entry with larger
identifier should be placed ahead of one with smaller identifier.
So if we calls remove_from_unowed_list() in descending order of
SMgrRelation entry's identifier, the performance of
remove_from_unowned_list()'s search should O(1). IOW, we can
immediately find the SMgrRelation entry to remove, at the head of
the list.

Regards,

--
Fujii Masao

#18

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Teodor Sigaev (#15)

Re: Speedup of relation deletes during recovery

On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very
significantly.

+1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

Regards,

--
Fujii Masao

#19

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Fujii Masao (#17)

Re: Speedup of relation deletes during recovery

On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:

On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
+
+   srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-   {
-           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+           srels[i] = smgropen(delrels[i], InvalidBackendId);
-           smgrdounlink(srel, false);
-           smgrclose(srel);
-   }
+   smgrdounlinkall(srels, ndelrels, false);
+
+   for (i = 0; i < ndelrels; i++)
+           smgrclose(srels[i]);
+   pfree(srels);
Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?
The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

for (i = ndelrels - 1; i >= 0; i--)
smgrclose(srels[i]);

instead of

for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

Greetings,

Andres Freund

#20

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Fujii Masao (#18)

Re: Speedup of relation deletes during recovery

On Tue, Jun 19, 2018 at 6:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very
significantly.

+1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

--
Thomas Munro
http://www.enterprisedb.com

#21

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#19)

Re: Speedup of relation deletes during recovery

On 2018-06-18 11:13:47 -0700, Andres Freund wrote:

On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:
On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
+
+   srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-   {
-           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+           srels[i] = smgropen(delrels[i], InvalidBackendId);
-           smgrdounlink(srel, false);
-           smgrclose(srel);
-   }
+   smgrdounlinkall(srels, ndelrels, false);
+
+   for (i = 0; i < ndelrels; i++)
+           smgrclose(srels[i]);
+   pfree(srels);
Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?
The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

for (i = ndelrels - 1; i >= 0; i--)
smgrclose(srels[i]);

instead of

for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

Greetings,

Andres Freund

#22

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Andres Freund (#21)

Re: Speedup of relation deletes during recovery

On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:

On 2018-06-18 11:13:47 -0700, Andres Freund wrote:

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

+1.  Let's also make sure that the removal of smgrdounlink and
smgrdounlinkfork happens only on master.
--
Michael

#23

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Michael Paquier (#22)

Re: Speedup of relation deletes during recovery

On 2018-06-21 14:40:58 +0900, Michael Paquier wrote:

On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:

On 2018-06-18 11:13:47 -0700, Andres Freund wrote:

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

+1. Let's also make sure that the removal of smgrdounlink and
smgrdounlinkfork happens only on master.

FWIW, I'd just skip removing them. They seem useful functionality and
don't hurt. I don't see the point in mucking around with them.

Greetings,

Andres Freund

#24

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Andres Freund (#23)

1 attachment(s)

Re: Speedup of relation deletes during recovery

On Fri, Jun 22, 2018 at 5:41 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-21 14:40:58 +0900, Michael Paquier wrote:

On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:

On 2018-06-18 11:13:47 -0700, Andres Freund wrote:

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

+ SMgrRelation *srels = NULL;

There seems to be no reason to initialise the variable to NULL in the
three places where you do that. It is always assigned a value before
use.

There is probably a way to share a bit more code among those three
places, but that sort of refactoring should be for later.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Use-a-doubly-linked-list-for-unowned-relations.patchapplication/octet-stream; name=0001-Use-a-doubly-linked-list-for-unowned-relations.patchDownload

From 72a25761300d02af1aea4f21cc05170531da0b08 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 27 Jun 2018 11:18:01 +1200
Subject: [PATCH] Use a doubly-linked list for unowned relations.

Previously we used a singly-linked list, which required
remove_from_unowned_list() to perform an O(n) search.  Make it O(1).

Author: Thomas Munro
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
---
 src/backend/storage/smgr/smgr.c | 43 ++++++++++++++++-----------------
 src/include/storage/smgr.h      |  1 +
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 08f06bade25..ae6da4928fc 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -236,15 +236,15 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 
 /*
  * add_to_unowned_list -- link an SMgrRelation onto the unowned list
- *
- * Check remove_from_unowned_list()'s comments for performance
- * considerations.
  */
 static void
 add_to_unowned_list(SMgrRelation reln)
 {
-	/* place it at head of the list (to make smgrsetowner cheap) */
+	/* place it at head of the list */
+	reln->prev_unowned_reln = NULL;
 	reln->next_unowned_reln = first_unowned_reln;
+	if (first_unowned_reln)
+		first_unowned_reln->prev_unowned_reln = reln;
 	first_unowned_reln = reln;
 }
 
@@ -253,30 +253,29 @@ add_to_unowned_list(SMgrRelation reln)
  *
  * If the reln is not present in the list, nothing happens.  Typically this
  * would be caller error, but there seems no reason to throw an error.
- *
- * In the worst case this could be rather slow; but in all the cases that seem
- * likely to be performance-critical, the reln being sought will actually be
- * first in the list.  Furthermore, the number of unowned relns touched in any
- * one transaction shouldn't be all that high typically.  So it doesn't seem
- * worth expending the additional space and management logic needed for a
- * doubly-linked list.
  */
 static void
 remove_from_unowned_list(SMgrRelation reln)
 {
-	SMgrRelation *link;
-	SMgrRelation cur;
+	SMgrRelation prev = reln->prev_unowned_reln;
+	SMgrRelation next = reln->next_unowned_reln;
 
-	for (link = &first_unowned_reln, cur = *link;
-		 cur != NULL;
-		 link = &cur->next_unowned_reln, cur = *link)
+	if (first_unowned_reln == reln)
+	{
+		/* It is at the head of the list. */
+		Assert(prev == NULL);
+		first_unowned_reln = next;
+		if (next)
+			next->prev_unowned_reln = NULL;
+		reln->next_unowned_reln = NULL;
+	}
+	else if (prev != NULL)
 	{
-		if (cur == reln)
-		{
-			*link = cur->next_unowned_reln;
-			cur->next_unowned_reln = NULL;
-			break;
-		}
+		/* It is in the list, but not at the head. */
+		prev->next_unowned_reln = next;
+		if (next)
+			next->prev_unowned_reln = prev;
+		reln->next_unowned_reln = reln->prev_unowned_reln = NULL;
 	}
 }
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518b..1bd978a7351 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -72,6 +72,7 @@ typedef struct SMgrRelationData
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
 	/* if unowned, list link in list of all unowned SMgrRelations */
+	struct SMgrRelationData *prev_unowned_reln;
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
 
-- 
2.17.0

#25

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Thomas Munro (#24)

1 attachment(s)

Re: Speedup of relation deletes during recovery

On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Use-a-dlist-for-unowned-relations.patchapplication/octet-stream; name=0001-Use-a-dlist-for-unowned-relations.patchDownload

From c17bcea12ea0ffc94b65dd837db0e664743ff4de Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 27 Jun 2018 11:18:01 +1200
Subject: [PATCH] Use a dlist for unowned relations.

Previously we used an open-coded singly-linked list, which required
remove_from_unowned_list() to perform an O(n) search.  Make it O(1).

Author: Thomas Munro
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
---
 src/backend/storage/smgr/smgr.c | 44 +++++++++------------------------
 src/include/storage/smgr.h      |  3 ++-
 2 files changed, 14 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 08f06bade25..1388557c1ba 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "commands/tablespace.h"
+#include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
@@ -82,7 +83,7 @@ static const int NSmgr = lengthof(smgrsw);
  */
 static HTAB *SMgrRelationHash = NULL;
 
-static SMgrRelation first_unowned_reln = NULL;
+static dlist_head unowned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
@@ -150,7 +151,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		ctl.entrysize = sizeof(SMgrRelationData);
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
-		first_unowned_reln = NULL;
+		dlist_init(&unowned_relns);
 	}
 
 	/* Look up or create an entry */
@@ -236,16 +237,11 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
 
 /*
  * add_to_unowned_list -- link an SMgrRelation onto the unowned list
- *
- * Check remove_from_unowned_list()'s comments for performance
- * considerations.
  */
 static void
 add_to_unowned_list(SMgrRelation reln)
 {
-	/* place it at head of the list (to make smgrsetowner cheap) */
-	reln->next_unowned_reln = first_unowned_reln;
-	first_unowned_reln = reln;
+	dlist_push_head(&unowned_relns, &reln->unowned_node);
 }
 
 /*
@@ -253,31 +249,11 @@ add_to_unowned_list(SMgrRelation reln)
  *
  * If the reln is not present in the list, nothing happens.  Typically this
  * would be caller error, but there seems no reason to throw an error.
- *
- * In the worst case this could be rather slow; but in all the cases that seem
- * likely to be performance-critical, the reln being sought will actually be
- * first in the list.  Furthermore, the number of unowned relns touched in any
- * one transaction shouldn't be all that high typically.  So it doesn't seem
- * worth expending the additional space and management logic needed for a
- * doubly-linked list.
  */
 static void
 remove_from_unowned_list(SMgrRelation reln)
 {
-	SMgrRelation *link;
-	SMgrRelation cur;
-
-	for (link = &first_unowned_reln, cur = *link;
-		 cur != NULL;
-		 link = &cur->next_unowned_reln, cur = *link)
-	{
-		if (cur == reln)
-		{
-			*link = cur->next_unowned_reln;
-			cur->next_unowned_reln = NULL;
-			break;
-		}
-	}
+	dlist_delete(&reln->unowned_node);
 }
 
 /*
@@ -797,13 +773,17 @@ smgrpostckpt(void)
 void
 AtEOXact_SMgr(void)
 {
+	SMgrRelation reln;
+	dlist_mutable_iter iter;
+
 	/*
 	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
 	 * one from the list.
 	 */
-	while (first_unowned_reln != NULL)
+	dlist_foreach_modify(iter, &unowned_relns)
 	{
-		Assert(first_unowned_reln->smgr_owner == NULL);
-		smgrclose(first_unowned_reln);
+		reln = dlist_container(SMgrRelationData, unowned_node, iter.cur);
+		Assert(reln->smgr_owner == NULL);
+		smgrclose(reln);
 	}
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518b..275fc6ce419 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "fmgr.h"
+#include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
@@ -72,7 +73,7 @@ typedef struct SMgrRelationData
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
 	/* if unowned, list link in list of all unowned SMgrRelations */
-	struct SMgrRelationData *next_unowned_reln;
+	dlist_node	unowned_node;
 } SMgrRelationData;
 
 typedef SMgrRelationData *SMgrRelation;
-- 
2.17.0

#26

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Thomas Munro (#25)

Re: Speedup of relation deletes during recovery

On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

... though, on second thoughts, the dlist version steam-rolls over the
possibility that it might not be in the list (mentioned in the
comments, though it's not immediately clear how that would happen).

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12; there doesn't seems to be a
good reason to carry all those comments warning about performance when
the O(1) version is shorter than the comments.

--
Thomas Munro
http://www.enterprisedb.com

#27

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#26)

Re: Speedup of relation deletes during recovery

On 2018-06-27 13:44:03 +1200, Thomas Munro wrote:

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12; there doesn't seems to be a
good reason to carry all those comments warning about performance when
the O(1) version is shorter than the comments.

Agreed on this. I like the dlist version more than the earlier one, so
let's fix it up, for v12+. But regardless I'd argue that we consider
disabling that infrastructure while in recovery - it's just unnecessary.

Greetings,

Andres Freund

#28

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Andres Freund (#27)

Re: Speedup of relation deletes during recovery

On Wed, Jun 27, 2018 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-27 13:44:03 +1200, Thomas Munro wrote:

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12; there doesn't seems to be a
good reason to carry all those comments warning about performance when
the O(1) version is shorter than the comments.

Agreed on this. I like the dlist version more than the earlier one, so
let's fix it up, for v12+. But regardless I'd argue that we consider
disabling that infrastructure while in recovery - it's just unnecessary.

I tested this patch using the functions that Michael provided + an
extra variant truncate_tables(), using the -v2 patch + the adjustment
to close in reverse order. I set shared_buffers to 4GB and
max_locks_per_transaction to 30000. I tested three CREATE/DROP code
paths, and also two different TRUNCATE paths:

SELECT create_tables(10000);
SELECT drop_tables(10000);

BEGIN;
SELECT create_tables(10000);
ROLLBACK;

SELECT create_tables(10000);
BEGIN;
SELECT drop_tables(10000);
PREPARE TRANSACTION 'x';
COMMIT PREPARED 'x';

SELECT create_tables(10000);
SELECT truncate_tables(10000);

In all these cases my startup process would eat 100% CPU for ~20
seconds, and with the patch this dropped to ~1-2 second (I didn't
measure precisely, I just watched htop) with the patch. I also tried
back-patching to 9.3 and the results were the same.

Based on this and positive feedback from other reviewers, I will mark
this "Ready for Committer" (meaning v2 + close-in-reverse-order). I
did have one cosmetic remark a few messages up (redundant variable
initialisation).

By the way, as noted by others already, there is still a way to reach
a horrible case with or without the patch:

BEGIN;
SELECT create_tables(10000);
SELECT truncate_tables(10000);

That's a different code path that eats a lot of CPU on the *primary*, because:

/*
* Normally, we need a transaction-safe truncation
here. However, if
* the table was either created in the current
(sub)transaction or has
* a new relfilenode in the current (sub)transaction,
then we can just
* truncate it in-place, because a rollback would
cause the whole
* table or the current physical file to be thrown away anyway.
*/
if (rel->rd_createSubid == mySubid ||
rel->rd_newRelfilenodeSubid == mySubid)
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
}
else

Without range-scannable buffer mapping (Andres's radix tree thing),
that bet doesn't work out too well when you do it more than once.
Hmm... we could just... not do that?

(Has anyone ever looked into a lazier approach to dropping buffers?)

--
Thomas Munro
http://www.enterprisedb.com

#29

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Thomas Munro (#28)

Re: Speedup of relation deletes during recovery

Hi,

On 2018-06-27 15:56:58 +1200, Thomas Munro wrote:

That's a different code path that eats a lot of CPU on the *primary*, because:

While that's obviously not great, I think it's far less dangerous than
the standby having worse complexity than the primary. There's an obvious
backpressure if commands on the primary take a while, but there's
normally none for commands with high complexity on the standby...

/*
* Normally, we need a transaction-safe truncation
here. However, if
* the table was either created in the current
(sub)transaction or has
* a new relfilenode in the current (sub)transaction,
then we can just
* truncate it in-place, because a rollback would
cause the whole
* table or the current physical file to be thrown away anyway.
*/
if (rel->rd_createSubid == mySubid ||
rel->rd_newRelfilenodeSubid == mySubid)
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);
}
else

Without range-scannable buffer mapping (Andres's radix tree thing),
that bet doesn't work out too well when you do it more than once.
Hmm... we could just... not do that?

That'd probably hurt noticably too...

(Has anyone ever looked into a lazier approach to dropping buffers?)

What precisely are you thinking of? We kinda now do something lazy-ish
at EOXact...

Greetings,

Andres Freund

#30

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Andres Freund (#29)

Re: Speedup of relation deletes during recovery

On Wed, Jun 27, 2018 at 4:17 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-27 15:56:58 +1200, Thomas Munro wrote:

Without range-scannable buffer mapping (Andres's radix tree thing),
that bet doesn't work out too well when you do it more than once.
Hmm... we could just... not do that?

That'd probably hurt noticably too...

Allow the optimisation only once per transaction?

(Has anyone ever looked into a lazier approach to dropping buffers?)

What precisely are you thinking of? We kinda now do something lazy-ish
at EOXact...

I mean at the buffer level. If we did nothing at all, the problem
would be dirty buffers that you'd eventually try to write out to
non-existant files. What if... you just remembered recently dropped
relations, and then whenever writing dirty buffers (either because of
stealing or checkpointing) you could check if they belong to recently
dropped relations and just mark them clean? To garbage collect the
recently dropped list, you could somehow make use of the knowledge
that checkpoints and full clock hand cycles must drop all such
buffers. I'm not proposing anything, just musing and wondering if
anyone has looked into this sort of thing...

--
Thomas Munro
http://www.enterprisedb.com

#31

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Thomas Munro (#26)

1 attachment(s)

Re: Speedup of relation deletes during recovery

On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

... though, on second thoughts, the dlist version steam-rolls over the
possibility that it might not be in the list (mentioned in the
comments, though it's not immediately clear how that would happen).

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12

Attached is v3 patch which I implemented close-in-reverse-order idea
on v2.

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v3.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v3.patchDownload

*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1457,1462 **** FinishPreparedTransaction(const char *gid, bool isCommit)
--- 1457,1463 ----
  	int			ndelrels;
  	SharedInvalidationMessage *invalmsgs;
  	int			i;
+ 	SMgrRelation *srels;
  
  	/*
  	 * Validate the GID, and lock the GXACT to ensure that two backends do not
***************
*** 1549,1561 **** FinishPreparedTransaction(const char *gid, bool isCommit)
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
  	for (i = 0; i < ndelrels; i++)
! 	{
! 		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
  
! 		smgrdounlink(srel, false);
! 		smgrclose(srel);
! 	}
  
  	/*
  	 * Handle cache invalidation messages.
--- 1550,1571 ----
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
+ 
+ 	srels = palloc(sizeof(SMgrRelation) * ndelrels);
  	for (i = 0; i < ndelrels; i++)
! 		srels[i] = smgropen(delrels[i], InvalidBackendId);
  
! 	smgrdounlinkall(srels, ndelrels, false);
! 
! 	/*
! 	 * Call smgrclose() in reverse order as when smgropen() is called.
! 	 * This trick enables remove_from_unowned_list() in smgrclose()
! 	 * to search the SMgrRelation from the unowned list,
! 	 * in O(1) performance.
! 	 */
! 	for (i = ndelrels - 1; i >= 0; i--)
! 		smgrclose(srels[i]);
! 	pfree(srels);
  
  	/*
  	 * Handle cache invalidation messages.
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 5618,5623 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5618,5625 ----
  	/* Make sure files supposed to be dropped are dropped */
  	if (parsed->nrels > 0)
  	{
+ 		SMgrRelation *srels;
+ 
  		/*
  		 * First update minimum recovery point to cover this WAL record. Once
  		 * a relation is deleted, there's no going back. The buffer manager
***************
*** 5635,5640 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
--- 5637,5643 ----
  		 */
  		XLogFlush(lsn);
  
+ 		srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  		for (i = 0; i < parsed->nrels; i++)
  		{
  			SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5642,5650 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
  		}
  	}
  
  	/*
--- 5645,5664 ----
  
  			for (fork = 0; fork <= MAX_FORKNUM; fork++)
  				XLogDropRelation(parsed->xnodes[i], fork);
! 			srels[i] = srel;
  		}
+ 
+ 		smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 		/*
+ 		 * Call smgrclose() in reverse order as when smgropen() is called.
+ 		 * This trick enables remove_from_unowned_list() in smgrclose()
+ 		 * to search the SMgrRelation from the unowned list,
+ 		 * in O(1) performance.
+ 		 */
+ 		for (i = parsed->nrels - 1; i >= 0; i--)
+ 			smgrclose(srels[i]);
+ 		pfree(srels);
  	}
  
  	/*
***************
*** 5685,5690 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5699,5705 ----
  {
  	int			i;
  	TransactionId max_xid;
+ 	SMgrRelation *srels;
  
  	Assert(TransactionIdIsValid(xid));
  
***************
*** 5748,5753 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
--- 5763,5769 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
+ 	srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
  	for (i = 0; i < parsed->nrels; i++)
  	{
  		SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
***************
*** 5755,5763 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
  	}
  }
  
  void
--- 5771,5790 ----
  
  		for (fork = 0; fork <= MAX_FORKNUM; fork++)
  			XLogDropRelation(parsed->xnodes[i], fork);
! 		srels[i] = srel;
  	}
+ 
+ 	smgrdounlinkall(srels, parsed->nrels, true);
+ 
+ 	/*
+ 	 * Call smgrclose() in reverse order as when smgropen() is called.
+ 	 * This trick enables remove_from_unowned_list() in smgrclose()
+ 	 * to search the SMgrRelation from the unowned list,
+ 	 * in O(1) performance.
+ 	 */
+ 	for (i = parsed->nrels - 1; i >= 0; i--)
+ 		smgrclose(srels[i]);
+ 	pfree(srels);
  }
  
  void

#32

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Fujii Masao (#31)

Re: Speedup of relation deletes during recovery

On 2018-06-28 03:21:51 +0900, Fujii Masao wrote:

On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

... though, on second thoughts, the dlist version steam-rolls over the
possibility that it might not be in the list (mentioned in the
comments, though it's not immediately clear how that would happen).

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12

+1

Attached is v3 patch which I implemented close-in-reverse-order idea
on v2.

Regards,

--
Fujii Masao

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
int			ndelrels;
SharedInvalidationMessage *invalmsgs;
int			i;
+	SMgrRelation *srels;

/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
delrels = abortrels;
ndelrels = hdr->nabortrels;
}
+
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+		srels[i] = smgropen(delrels[i], InvalidBackendId);

-		smgrdounlink(srel, false);
-		smgrclose(srel);
-	}
+	smgrdounlinkall(srels, ndelrels, false);
+
+	/*
+	 * Call smgrclose() in reverse order as when smgropen() is called.
+	 * This trick enables remove_from_unowned_list() in smgrclose()
+	 * to search the SMgrRelation from the unowned list,
+	 * in O(1) performance.
+	 */
+	for (i = ndelrels - 1; i >= 0; i--)
+		smgrclose(srels[i]);
+	pfree(srels);

/*
* Handle cache invalidation messages.
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
/* Make sure files supposed to be dropped are dropped */
if (parsed->nrels > 0)
{
+		SMgrRelation *srels;
+
/*
* First update minimum recovery point to cover this WAL record. Once
* a relation is deleted, there's no going back. The buffer manager
@@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
*/
XLogFlush(lsn);

+		srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
for (i = 0; i < parsed->nrels; i++)
{
SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
@@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,

for (fork = 0; fork <= MAX_FORKNUM; fork++)
XLogDropRelation(parsed->xnodes[i], fork);
-			smgrdounlink(srel, true);
-			smgrclose(srel);
+			srels[i] = srel;
}
+
+		smgrdounlinkall(srels, parsed->nrels, true);
+
+		/*
+		 * Call smgrclose() in reverse order as when smgropen() is called.
+		 * This trick enables remove_from_unowned_list() in smgrclose()
+		 * to search the SMgrRelation from the unowned list,
+		 * in O(1) performance.
+		 */
+		for (i = parsed->nrels - 1; i >= 0; i--)
+			smgrclose(srels[i]);
+		pfree(srels);
}

/*
@@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
TransactionId max_xid;
+ SMgrRelation *srels;

Assert(TransactionIdIsValid(xid));

@@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
}

/* Make sure files supposed to be dropped are dropped */
+	srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
for (i = 0; i < parsed->nrels; i++)
{
SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
@@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)

for (fork = 0; fork <= MAX_FORKNUM; fork++)
XLogDropRelation(parsed->xnodes[i], fork);
-		smgrdounlink(srel, true);
-		smgrclose(srel);
+		srels[i] = srel;
}
+
+	smgrdounlinkall(srels, parsed->nrels, true);
+
+	/*
+	 * Call smgrclose() in reverse order as when smgropen() is called.
+	 * This trick enables remove_from_unowned_list() in smgrclose()
+	 * to search the SMgrRelation from the unowned list,
+	 * in O(1) performance.
+	 */
+	for (i = parsed->nrels - 1; i >= 0; i--)
+		smgrclose(srels[i]);
+	pfree(srels);
}

void

Can you please move the repeated stanzy to a separate function?
Repeating the same code and comment three times isn't good.

Greetings,

Andres Freund

#33

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Andres Freund (#32)

1 attachment(s)

Re: Speedup of relation deletes during recovery

On Thu, Jun 28, 2018 at 3:23 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-28 03:21:51 +0900, Fujii Masao wrote:

On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think we should take the hint in the comments and make it O(1)
anyway. See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

... though, on second thoughts, the dlist version steam-rolls over the
possibility that it might not be in the list (mentioned in the
comments, though it's not immediately clear how that would happen).

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea. We should
reconsider that data structure for 12

+1

Attached is v3 patch which I implemented close-in-reverse-order idea
on v2.

Regards,

--
Fujii Masao

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
int                     ndelrels;
SharedInvalidationMessage *invalmsgs;
int                     i;
+     SMgrRelation *srels;

/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
delrels = abortrels;
ndelrels = hdr->nabortrels;
}
+
+     srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-     {
-             SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+             srels[i] = smgropen(delrels[i], InvalidBackendId);

-             smgrdounlink(srel, false);
-             smgrclose(srel);
-     }
+     smgrdounlinkall(srels, ndelrels, false);
+
+     /*
+      * Call smgrclose() in reverse order as when smgropen() is called.
+      * This trick enables remove_from_unowned_list() in smgrclose()
+      * to search the SMgrRelation from the unowned list,
+      * in O(1) performance.
+      */
+     for (i = ndelrels - 1; i >= 0; i--)
+             smgrclose(srels[i]);
+     pfree(srels);

/*
* Handle cache invalidation messages.
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
/* Make sure files supposed to be dropped are dropped */
if (parsed->nrels > 0)
{
+             SMgrRelation *srels;
+
/*
* First update minimum recovery point to cover this WAL record. Once
* a relation is deleted, there's no going back. The buffer manager
@@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
*/
XLogFlush(lsn);

+             srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
for (i = 0; i < parsed->nrels; i++)
{
SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
@@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,

for (fork = 0; fork <= MAX_FORKNUM; fork++)
XLogDropRelation(parsed->xnodes[i], fork);
-                     smgrdounlink(srel, true);
-                     smgrclose(srel);
+                     srels[i] = srel;
}
+
+             smgrdounlinkall(srels, parsed->nrels, true);
+
+             /*
+              * Call smgrclose() in reverse order as when smgropen() is called.
+              * This trick enables remove_from_unowned_list() in smgrclose()
+              * to search the SMgrRelation from the unowned list,
+              * in O(1) performance.
+              */
+             for (i = parsed->nrels - 1; i >= 0; i--)
+                     smgrclose(srels[i]);
+             pfree(srels);
}

/*
@@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
{
int i;
TransactionId max_xid;
+ SMgrRelation *srels;

Assert(TransactionIdIsValid(xid));

@@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
}

/* Make sure files supposed to be dropped are dropped */
+     srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
for (i = 0; i < parsed->nrels; i++)
{
SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
@@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)

for (fork = 0; fork <= MAX_FORKNUM; fork++)
XLogDropRelation(parsed->xnodes[i], fork);
-             smgrdounlink(srel, true);
-             smgrclose(srel);
+             srels[i] = srel;
}
+
+     smgrdounlinkall(srels, parsed->nrels, true);
+
+     /*
+      * Call smgrclose() in reverse order as when smgropen() is called.
+      * This trick enables remove_from_unowned_list() in smgrclose()
+      * to search the SMgrRelation from the unowned list,
+      * in O(1) performance.
+      */
+     for (i = parsed->nrels - 1; i >= 0; i--)
+             smgrclose(srels[i]);
+     pfree(srels);
}

void

Can you please move the repeated stanzy to a separate function?
Repeating the same code and comment three times isn't good.

OK, so what about the attached patch?

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v4.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v4.patchDownload

*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1456,1462 **** FinishPreparedTransaction(const char *gid, bool isCommit)
  	RelFileNode *delrels;
  	int			ndelrels;
  	SharedInvalidationMessage *invalmsgs;
- 	int			i;
  
  	/*
  	 * Validate the GID, and lock the GXACT to ensure that two backends do not
--- 1456,1461 ----
***************
*** 1549,1561 **** FinishPreparedTransaction(const char *gid, bool isCommit)
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
- 	for (i = 0; i < ndelrels; i++)
- 	{
- 		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
  
! 		smgrdounlink(srel, false);
! 		smgrclose(srel);
! 	}
  
  	/*
  	 * Handle cache invalidation messages.
--- 1548,1556 ----
  		delrels = abortrels;
  		ndelrels = hdr->nabortrels;
  	}
  
! 	/* Make sure files supposed to be dropped are dropped */
! 	DropRelationFiles(delrels, ndelrels, false);
  
  	/*
  	 * Handle cache invalidation messages.
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 5516,5522 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  				 RepOriginId origin_id)
  {
  	TransactionId max_xid;
- 	int			i;
  	TimestampTz commit_time;
  
  	Assert(TransactionIdIsValid(xid));
--- 5516,5521 ----
***************
*** 5635,5650 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  		 */
  		XLogFlush(lsn);
  
! 		for (i = 0; i < parsed->nrels; i++)
! 		{
! 			SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
! 			ForkNumber	fork;
! 
! 			for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 				XLogDropRelation(parsed->xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
! 		}
  	}
  
  	/*
--- 5634,5641 ----
  		 */
  		XLogFlush(lsn);
  
! 		/* Make sure files supposed to be dropped are dropped */
! 		DropRelationFiles(parsed->xnodes, parsed->nrels, true);
  	}
  
  	/*
***************
*** 5683,5689 **** xact_redo_commit(xl_xact_parsed_commit *parsed,
  static void
  xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
  {
- 	int			i;
  	TransactionId max_xid;
  
  	Assert(TransactionIdIsValid(xid));
--- 5674,5679 ----
***************
*** 5748,5763 **** xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	for (i = 0; i < parsed->nrels; i++)
! 	{
! 		SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
! 		ForkNumber	fork;
! 
! 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 			XLogDropRelation(parsed->xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
! 	}
  }
  
  void
--- 5738,5744 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
  }
  
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 26,31 ****
--- 26,32 ----
  #include <sys/file.h>
  
  #include "miscadmin.h"
+ #include "access/xlogutils.h"
  #include "access/xlog.h"
  #include "pgstat.h"
  #include "portability/instr_time.h"
***************
*** 1703,1708 **** ForgetDatabaseFsyncRequests(Oid dbid)
--- 1704,1746 ----
  	}
  }
  
+ /*
+  * DropRelationFiles -- drop files of all given relations
+  */
+ void
+ DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo)
+ {
+ 	SMgrRelation *srels;
+ 	int			i;
+ 
+ 	srels = palloc(sizeof(SMgrRelation) * ndelrels);
+ 	for (i = 0; i < ndelrels; i++)
+ 	{
+ 		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+ 
+ 		if (isRedo)
+ 		{
+ 			ForkNumber	fork;
+ 
+ 			for (fork = 0; fork <= MAX_FORKNUM; fork++)
+ 				XLogDropRelation(delrels[i], fork);
+ 		}
+ 		srels[i] = srel;
+ 	}
+ 
+ 	smgrdounlinkall(srels, ndelrels, isRedo);
+ 
+ 	/*
+ 	 * Call smgrclose() in reverse order as when smgropen() is called.
+ 	 * This trick enables remove_from_unowned_list() in smgrclose()
+ 	 * to search the SMgrRelation from the unowned list,
+ 	 * in O(1) performance.
+ 	 */
+ 	for (i = ndelrels - 1; i >= 0; i--)
+ 		smgrclose(srels[i]);
+ 	pfree(srels);
+ }
+ 
  
  /*
   *	_fdvec_resize() -- Resize the fork's open segments array
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 143,147 **** extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
--- 143,148 ----
  					 BlockNumber segno);
  extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
  extern void ForgetDatabaseFsyncRequests(Oid dbid);
+ extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
  
  #endif							/* SMGR_H */

#34

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Fujii Masao (#33)

Re: Speedup of relation deletes during recovery

On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote:

OK, so what about the attached patch?

I have been looking at this patch, and this looks in good shape to me
(please indent!).

+    * Call smgrclose() in reverse order as when smgropen() is called.
+    * This trick enables remove_from_unowned_list() in smgrclose()
+    * to search the SMgrRelation from the unowned list,
+    * in O(1) performance.

A nit here: with O(1) performance.

Could it be possible to add an assertion so as isRedo is never enabled
out of recovery?
--
Michael

#35

Fujii Masao

masao.fujii@gmail.com

over 7 years ago

In reply to: Michael Paquier (#34)

Re: Speedup of relation deletes during recovery

On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote:

OK, so what about the attached patch?

I have been looking at this patch, and this looks in good shape to me

Thanks for the review! So, committed.

(please indent!).

Hmm.. I failed to find indent issue in my patch... But anyway
future execution of pgindent will fix that even if it exists.

+    * Call smgrclose() in reverse order as when smgropen() is called.
+    * This trick enables remove_from_unowned_list() in smgrclose()
+    * to search the SMgrRelation from the unowned list,
+    * in O(1) performance.

A nit here: with O(1) performance.

I changed the patch accordingly.

Could it be possible to add an assertion so as isRedo is never enabled
out of recovery?

I'm not sure if adding such assertion into only the function that
the patch added is helpful. There are many functions having something
like isRedo as an argument.

Regards,

--
Fujii Masao

#36

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Fujii Masao (#35)

Re: Speedup of relation deletes during recovery

On Thu, Jul 05, 2018 at 03:10:33AM +0900, Fujii Masao wrote:

On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote:
Thanks for the review! So, committed.

Thanks.

(please indent!).

Hmm.. I failed to find indent issue in my patch... But anyway
future execution of pgindent will fix that even if it exists.

md.c is telling a rather different story ;)

Well, spurious noise when touching the same files for a different patch
is annoying... I have made the experience already for a couple of
commits.
--
Michael

#37

Jamison, Kirk

k.jamison@jp.fujitsu.com

over 6 years ago

In reply to: Fujii Masao (#9)

RE: Speedup of relation deletes during recovery

Hello Fujii-san,

On April 18, 2018, Fujii Masao wrote:

On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE
scans the shared buffers once for each table, TRUNCATE does that for
each fork, resulting in three scans per table. How about changing the
following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber
nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber
*nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock,
int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem,
I think I'll submit the patch next month. I'd welcome if anyone could do that.

Yeah, it's worth working on this problem. To decrease the number of scans of
shared_buffers, you would need to change the order of truncations of files and WAL
logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated.
IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would
need to check whether this reordering is valid.

I know it's been almost a year since the previous message, but if you could still
recall your suggestion above, I'd like to ask question.
Could you explain your idea a bit further on how would the reordering of WAL writing and
file truncation possibly reduce the number of shared_buffers scan?
Thanks a lot in advance.

Regards,
Kirk Jamison

#38

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Jamison, Kirk (#37)

Re: Speedup of relation deletes during recovery

On Tue, Apr 16, 2019 at 10:48 AM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:

Hello Fujii-san,

On April 18, 2018, Fujii Masao wrote:

On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE
scans the shared buffers once for each table, TRUNCATE does that for
each fork, resulting in three scans per table. How about changing the
following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber
nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber
*nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock,
int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem,
I think I'll submit the patch next month. I'd welcome if anyone could do that.

Yeah, it's worth working on this problem. To decrease the number of scans of
shared_buffers, you would need to change the order of truncations of files and WAL
logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated.
IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would
need to check whether this reordering is valid.

I know it's been almost a year since the previous message, but if you could still
recall your suggestion above, I'd like to ask question.
Could you explain your idea a bit further on how would the reordering of WAL writing and
file truncation possibly reduce the number of shared_buffers scan?

Sorry, I forgot the detail of that my comment. But anyway, you want to
modify smgr_redo(info = XLOG_SMGR_TRUNCATE) so that the number of
scans of shared_buffers to be decreased to one. Right?

IMO it's worth thinking to change smgrtruncate(MAIN_FORK),
FreeSpaceMapTruncateRel() and visibilitymap_truncate() so that
they just mark the relation and blocks as to be deleted, and then
so that they scan shared_buffers at once to invalidate the blocks
at the end of smgr_redo().

Regards,

--
Fujii Masao