slow dropping of tables, DropRelFileNodeBuffers, tas

Started by Sergey Koposovover 13 years ago23 messages
#1Sergey Koposov
koposov@ast.cam.ac.uk

Hello,

I was running some tests on PG9.2beta where I'm creating and dropping
large number of tables (~ 20000).

And I noticed that table dropping was extremely slow -- e.g. like half a
second per table.

I tried to move the table dropping into bigger transactions (100 per one
transaction) (increasing in the same time max_locks_per_trans to 128).
And still, the commits took ~ 50-60 seconds.

I tried to oprofile it, and I saw that 99% is spend on
DropRelFileNodeBuffers, and when compiled with symbols (CFLAGS=-g)
I saw that actually most of the time is spend in tas() function, see
below:

PU: Intel Architectural Perfmon, speed 1862 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % linenr info symbol name
-------------------------------------------------------------------------------
831665 86.6756 s_lock.h:204 tas
831665 100.000 s_lock.h:204 tas [self]
-------------------------------------------------------------------------------
122573 12.7745 bufmgr.c:2048 DropRelFileNodeBuffers
122573 100.000 bufmgr.c:2048 DropRelFileNodeBuffers [self]
-------------------------------------------------------------------------------
849 0.0885 xlog.c:697 XLogInsert
849 100.000 xlog.c:697 XLogInsert [self]
-------------------------------------------------------------------------------
269 0.0280 catcache.c:444 CatalogCacheIdInvalidate
269 100.000 catcache.c:444 CatalogCacheIdInvalidate [self]
-------------------------------------------------------------------------------
232 0.0242 catcache.c:1787 PrepareToInvalidateCacheTuple
232 100.000 catcache.c:1787 PrepareToInvalidateCacheTuple [self]
-------------------------------------------------------------------------------
202 0.0211 dynahash.c:807 hash_search_with_hash_value
202 100.000 dynahash.c:807 hash_search_with_hash_value [self]
-------------------------------------------------------------------------------
199 0.0207 nbtsearch.c:344 _bt_compare
199 100.000 nbtsearch.c:344 _bt_compare [self]
-------------------------------------------------------------------------------
198 0.0206 list.c:506 list_member_oid
198 100.000 list.c:506 list_member_oid [self]

Is the current behaviour expected ?
Because I saw the comment around droprelfilenodebuffers, saying
"XXX currently it sequentially searches the buffer pool, should be
changed to more clever ways of searching. However, this
routine is used only in code paths that aren't very
performance-critical, and we shouldn't slow down the hot paths to make it
faster ".

Maybe it is stupid, but I also wonder whether the root cause for what I'm
seeing can be also responsible for the problems I recently reported
about the scaling and locking
http://archives.postgresql.org/pgsql-hackers/2012-05/msg01118.php

Some additional info:
The database is accessed in a single thread, shared_buffers are 10G.
The tables themselves are empty essentially. the cpu if it matters is
4 times Xeon E7- 4807 (Westmere architecture).
I did a vacuum full of everything just in case and it didn't help
And another maybe important factor is that I noticed is that
pg_catalog.pg_attribute is quite large (238 meg) (because of the large
number of tables times columns).

I also stopped PG with gdb a few times and it was always at this backtrace:

(gdb) bt
#0 tas (lock=0x7fa4e3007538 "\001") at ../../../../src/include/storage/s_lock.h:218
#1 0x00000000006e6956 in DropRelFileNodeBuffers (rnode=..., forkNum=VISIBILITYMAP_FORKNUM, firstDelBlock=0) at bufmgr.c:2062
#2 0x000000000070c014 in smgrdounlink (reln=0x1618210, forknum=VISIBILITYMAP_FORKNUM, isRedo=0 '\000') at smgr.c:354
#3 0x000000000051ecf6 in smgrDoPendingDeletes (isCommit=1 '\001') at storage.c:364
#4 0x00000000004a7b33 in CommitTransaction () at xact.c:1925
#5 0x00000000004a8479 in CommitTransactionCommand () at xact.c:2524
#6 0x0000000000710b3f in finish_xact_command () at postgres.c:2419
#7 0x000000000070ff4e in exec_execute_message (portal_name=0x1608990 "", max_rows=1) at postgres.c:1956
#8 0x0000000000712988 in PostgresMain (argc=2, argv=0x1548568, username=0x1548390
with only forknum changing to MAIN_FORKNUM or FSM_FORKNUM

Thanks in advance,
Sergey

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Sergey Koposov (#1)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 30.05.2012 03:40, Sergey Koposov wrote:

I was running some tests on PG9.2beta where I'm creating and dropping
large number of tables (~ 20000).

And I noticed that table dropping was extremely slow -- e.g. like half a
second per table.

...

I also stopped PG with gdb a few times and it was always at this backtrace:

(gdb) bt
#0 tas (lock=0x7fa4e3007538 "\001") at
../../../../src/include/storage/s_lock.h:218
#1 0x00000000006e6956 in DropRelFileNodeBuffers (rnode=...,
forkNum=VISIBILITYMAP_FORKNUM, firstDelBlock=0) at bufmgr.c:2062
#2 0x000000000070c014 in smgrdounlink (reln=0x1618210,
forknum=VISIBILITYMAP_FORKNUM, isRedo=0 '\000') at smgr.c:354
#3 0x000000000051ecf6 in smgrDoPendingDeletes (isCommit=1 '\001') at
storage.c:364

Hmm, we do this in smgrDoPendingDeletes:

for (i = 0; i <= MAX_FORKNUM; i++)
{
smgrdounlink(srel, i, false);
}

So we drop the buffers for each relation fork separately, which means
that we scan the buffer pool four times. Relation forks in 8.4
introduced that issue, and 9.1 made it worse by adding another fork for
unlogged tables. With some refactoring, we could scan the buffer pool
just once. That would help a lot.

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test
first, and only grab the spinlock on buffers that need to be dropped.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Wed, May 30, 2012 at 7:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

So we drop the buffers for each relation fork separately, which means that
we scan the buffer pool four times. Relation forks in 8.4 introduced that
issue, and 9.1 made it worse by adding another fork for unlogged tables.
With some refactoring, we could scan the buffer pool just once. That would
help a lot.

+1.

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

I think it would be possible for the unlocked test to indicate that
the buffer should be dropped when it really ought not to be, because
someone else might be in the middle of changing the buffer tag, and
that's not atomic. So you'd have to recheck after taking the
spinlock. However, I don't think it's possible for the unlocked test
to report a false negative, because we've already taken
AccessExclusiveLock on the relation, which had better be enough to
guarantee that nobody's pulling in any more buffers from that relation
(if it doesn't guarantee that, the current code is already broken).
Acquiring a heavyweight lock also interposes a full memory barrier,
which should eliminate any risks due to memory-ordering effects.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Wed, May 30, 2012 at 4:10 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 30.05.2012 03:40, Sergey Koposov wrote:

I was running some tests on PG9.2beta where I'm creating and dropping
large number of tables (~ 20000).

And I noticed that table dropping was extremely slow -- e.g. like half a
second per table.

...

I also stopped PG with gdb a few times and it was always at this
backtrace:

(gdb) bt
#0 tas (lock=0x7fa4e3007538 "\001") at
../../../../src/include/storage/s_lock.h:218
#1 0x00000000006e6956 in DropRelFileNodeBuffers (rnode=...,
forkNum=VISIBILITYMAP_FORKNUM, firstDelBlock=0) at bufmgr.c:2062
#2 0x000000000070c014 in smgrdounlink (reln=0x1618210,
forknum=VISIBILITYMAP_FORKNUM, isRedo=0 '\000') at smgr.c:354
#3 0x000000000051ecf6 in smgrDoPendingDeletes (isCommit=1 '\001') at
storage.c:364

Hmm, we do this in smgrDoPendingDeletes:

for (i = 0; i <= MAX_FORKNUM; i++)
{
       smgrdounlink(srel, i, false);
}

So we drop the buffers for each relation fork separately, which means that
we scan the buffer pool four times. Relation forks in 8.4 introduced that
issue, and 9.1 made it worse by adding another fork for unlogged tables.
With some refactoring, we could scan the buffer pool just once. That would
help a lot.

If someone drops many tables in the same transaction, could it be made
to stuff them into a hash table and then drop all of them with one
cycle around the buffer pool? Or is the use case for that too narrow
a use case to be worthwhile?

Cheers,

Jeff

#5Ants Aasma
ants@cybertec.at
In reply to: Heikki Linnakangas (#2)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Wed, May 30, 2012 at 2:10 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

The scanning of buffers seemed awfully slow to me. Doing the math it
ends up being somewhere around 120ns per buffer, about consistent with
a full cache miss. It looks like the spinlock tas implementation (lock
xchgb) is preventing prefetching. This suspicion is corroborated by
the following comment in Linux kernels per_cpu.h:

/*
* xchg is implemented using cmpxchg without a lock prefix. xchg is
* expensive due to the implied lock prefix. The processor cannot prefetch
* cachelines if xchg is used.
*/

I'm not sure, but doing an unlocked test first might also be useful in
triggering the prefetchers. The CPU should be doing a lot better than
the current ~4.3GB/s when scanning buffer descriptors.

Of course not scanning at all or doing less scans at the expense of
more work in the inner loop would be even better.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
1 attachment(s)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 30 May 2012 12:10, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, we do this in smgrDoPendingDeletes:

for (i = 0; i <= MAX_FORKNUM; i++)
{
       smgrdounlink(srel, i, false);
}

So we drop the buffers for each relation fork separately, which means that
we scan the buffer pool four times. Relation forks in 8.4 introduced that
issue, and 9.1 made it worse by adding another fork for unlogged tables.
With some refactoring, we could scan the buffer pool just once. That would
help a lot.

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

dropallforks.v1.patchapplication/octet-stream; name=dropallforks.v1.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6db46c0..c01b170 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1356,12 +1356,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	for (i = 0; i < ndelrels; i++)
 	{
 		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
-		ForkNumber	fork;
-
-		for (fork = 0; fork <= MAX_FORKNUM; fork++)
-		{
-			smgrdounlink(srel, fork, false);
-		}
+		smgrdounlink(srel, AllForks, false);
 		smgrclose(srel);
 	}
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5ae46de..a117660 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4633,13 +4633,8 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	for (i = 0; i < nrels; i++)
 	{
 		SMgrRelation srel = smgropen(xnodes[i], InvalidBackendId);
-		ForkNumber	fork;
-
-		for (fork = 0; fork <= MAX_FORKNUM; fork++)
-		{
-			XLogDropRelation(xnodes[i], fork);
-			smgrdounlink(srel, fork, true);
-		}
+		XLogDropRelation(xnodes[i], AllForks);
+		smgrdounlink(srel, AllForks, true);
 		smgrclose(srel);
 	}
 
@@ -4773,13 +4768,8 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
 	for (i = 0; i < xlrec->nrels; i++)
 	{
 		SMgrRelation srel = smgropen(xlrec->xnodes[i], InvalidBackendId);
-		ForkNumber	fork;
-
-		for (fork = 0; fork <= MAX_FORKNUM; fork++)
-		{
-			XLogDropRelation(xlrec->xnodes[i], fork);
-			smgrdounlink(srel, fork, true);
-		}
+		XLogDropRelation(xlrec->xnodes[i], AllForks);
+		smgrdounlink(srel, AllForks, true);
 		smgrclose(srel);
 	}
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f286cdf..2578585 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -148,7 +148,7 @@ forget_invalid_pages(RelFileNode node, ForkNumber forkno, BlockNumber minblkno)
 	while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
 	{
 		if (RelFileNodeEquals(hentry->key.node, node) &&
-			hentry->key.forkno == forkno &&
+			(hentry->key.forkno == forkno || forkno == AllForks) &&
 			hentry->key.blkno >= minblkno)
 		{
 			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a017101..b871604 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -356,13 +356,9 @@ smgrDoPendingDeletes(bool isCommit)
 			if (pending->atCommit == isCommit)
 			{
 				SMgrRelation srel;
-				int			i;
 
 				srel = smgropen(pending->relnode, pending->backend);
-				for (i = 0; i <= MAX_FORKNUM; i++)
-				{
-					smgrdounlink(srel, i, false);
-				}
+				smgrdounlink(srel, AllForks, false);
 				smgrclose(srel);
 			}
 			/* must explicitly free the list entry */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a1b588b..124983c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2061,7 +2061,8 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
 
 		LockBufHdr(bufHdr);
 		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
-			bufHdr->tag.forkNum == forkNum &&
+			(bufHdr->tag.forkNum == forkNum ||
+			 forkNum == AllForks) &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
 		else
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5f87543..c875606 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -343,9 +343,16 @@ smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
 	RelFileNodeBackend rnode = reln->smgr_rnode;
 	int			which = reln->smgr_which;
+	int			i;
 
 	/* Close the fork */
-	(*(smgrsw[which].smgr_close)) (reln, forknum);
+	if (forknum == AllForks)
+	{
+		for (i = 0; i <= MAX_FORKNUM; i++)
+			(*(smgrsw[which].smgr_close)) (reln, i);
+	}
+	else
+		(*(smgrsw[which].smgr_close)) (reln, forknum);
 
 	/*
 	 * Get rid of any remaining buffers for the relation.  bufmgr will just
@@ -377,7 +384,13 @@ smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	 * ERROR, because we've already decided to commit or abort the current
 	 * xact.
 	 */
-	(*(smgrsw[which].smgr_unlink)) (rnode, forknum, isRedo);
+	if (forknum == AllForks)
+	{
+		for (i = 0; i <= MAX_FORKNUM; i++)
+			(*(smgrsw[which].smgr_unlink)) (rnode, i, isRedo);
+	}
+	else
+		(*(smgrsw[which].smgr_unlink)) (rnode, forknum, isRedo);
 }
 
 /*
diff --git a/src/include/storage/relfilenode.h b/src/include/storage/relfilenode.h
index 60c3829..451139c 100644
--- a/src/include/storage/relfilenode.h
+++ b/src/include/storage/relfilenode.h
@@ -20,10 +20,12 @@
  * The physical storage of a relation consists of one or more forks. The
  * main fork is always created, but in addition to that there can be
  * additional forks for storing various metadata. ForkNumber is used when
- * we need to refer to a specific fork in a relation.
+ * we need to refer to a specific fork in a relation, or AllForks
+ * can be used in places to refer to all forks of a relation.
  */
 typedef enum ForkNumber
 {
+	AllForks = -2,
 	InvalidForkNumber = -1,
 	MAIN_FORKNUM = 0,
 	FSM_FORKNUM,
#7Sergey Koposov
koposov@ast.cam.ac.uk
In reply to: Simon Riggs (#6)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Thu, 31 May 2012, Simon Riggs wrote:

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

Thanks! The patch indeed improved the timings,
The dropping of 100 tables in a single commit before the patch took ~ 50
seconds, now it takes ~ 5 sec (it would be nice to reduce it further
though, because the dropping of 10000 tables still takes ~10 min).

Cheers,
S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Sergey Koposov (#7)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Thu, May 31, 2012 at 11:09 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:

On Thu, 31 May 2012, Simon Riggs wrote:

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

Thanks! The patch indeed improved the timings, The dropping of 100 tables in
a single commit before the patch took ~ 50 seconds, now it takes ~ 5 sec (it
would be nice to reduce it further though, because the dropping of 10000
tables still takes ~10 min).

I'm surprised it helped that much. I thought the most it could
theoretically could help would be a factor of 4.

I tried the initially unlocked test, and for me it cut the time by a
factor of 3. But I only have a 1GB shared_buffers at the max, I would
expect it help more at larger sizes because there is a constant
overhead not related to scanning the shared buffers which gets diluted
out the larger shared_buffers is.

I added to that a drop-all very similar to what Simon posted and got
another factor of 3.

But, if you can do this during a maintenance window, then just
restarting with a much smaller shared_buffers should give you a much
larger speed up than either or both of these. If I can extrapolate up
to 10G from my current curve, setting it to 8MB instead would give a
speed up of nearly 400 fold.

Cheers,

Jeff

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Sergey Koposov (#7)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 31 May 2012 19:09, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:

On Thu, 31 May 2012, Simon Riggs wrote:

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

Thanks! The patch indeed improved the timings, The dropping of 100 tables in
a single commit before the patch took ~ 50 seconds, now it takes ~ 5 sec

Thanks for the timing.

(it
would be nice to reduce it further though, because the dropping of 10000
tables still takes ~10 min).

Why do you have 10,000 tables and why is it important to drop them so quickly?

If its that important, why not run the drop in parallel sessions?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#10Sergey Koposov
koposov@ast.cam.ac.uk
In reply to: Simon Riggs (#9)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Fri, 1 Jun 2012, Simon Riggs wrote:

Why do you have 10,000 tables and why is it important to drop them so quickly?

10000 tables are there, because that's the number of partitions. And I'm
dropping them at the moment, because I'm doing testing. So it won't be
really crucial for production. But I still thought it was worth reporting.
Especially when the table dropping took .5 a sec.

The problem is that when I set up the shared_buffers say to 48G, the
timings of the tables rise significantly again.

If its that important, why not run the drop in parallel sessions?

Yes, before there was a strong reason to do that, now the timings are more
manageable, but maybe I'll implement that.

Cheers,
S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Sergey Koposov (#10)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 1 June 2012 12:34, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:

On Fri, 1 Jun 2012, Simon Riggs wrote:

Why do you have 10,000 tables and why is it important to drop them so
quickly?

10000 tables are there, because that's the number of partitions. And I'm
dropping them at the moment, because I'm doing testing. So it won't be
really crucial for production. But I still thought it was worth reporting.
Especially when the table dropping took .5 a sec.

Ah, partitions. That explains the long drop time.

Hopefully people don't need to do that too frequently.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#12Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#6)
1 attachment(s)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Thu, May 31, 2012 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 May 2012 12:10, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, we do this in smgrDoPendingDeletes:

for (i = 0; i <= MAX_FORKNUM; i++)
{
       smgrdounlink(srel, i, false);
}

So we drop the buffers for each relation fork separately, which means that
we scan the buffer pool four times. Relation forks in 8.4 introduced that
issue, and 9.1 made it worse by adding another fork for unlogged tables.
With some refactoring, we could scan the buffer pool just once. That would
help a lot.

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

By directly do you mean before the fork/commit fest begins?

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers. That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?

Thanks,

Jeff

Attachments:

DropRelFileNodeBuffers_unlock_v1.patchapplication/octet-stream; name=DropRelFileNodeBuffers_unlock_v1.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
new file mode 100644
index acec3bd..1c26555
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** DropRelFileNodeBuffers(RelFileNodeBacken
*** 2059,2064 ****
--- 2059,2068 ----
  	{
  		volatile BufferDesc *bufHdr = &BufferDescriptors[i];
  
+ 		/* due to AccessExclusive lock, no tags should be mutating towards
+ 		 * The rnode.  They might be mutating away, so recheck after locking*/
+ 		if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) continue;
+ 
  		LockBufHdr(bufHdr);
  		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
  			(bufHdr->tag.forkNum == forkNum ||
#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#12)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 3 June 2012 19:07, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, May 31, 2012 at 5:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 May 2012 12:10, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, we do this in smgrDoPendingDeletes:

for (i = 0; i <= MAX_FORKNUM; i++)
{
       smgrdounlink(srel, i, false);
}

So we drop the buffers for each relation fork separately, which means that
we scan the buffer pool four times. Relation forks in 8.4 introduced that
issue, and 9.1 made it worse by adding another fork for unlogged tables.
With some refactoring, we could scan the buffer pool just once. That would
help a lot.

That struck me as a safe and easy optimisation. This was a problem I'd
been trying to optimise for 9.2, so I've written a patch that appears
simple and clean enough to be applied directly.

By directly do you mean before the fork/commit fest begins?

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?

That's enough for me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#14Sergey Koposov
koposov@ast.cam.ac.uk
In reply to: Simon Riggs (#13)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Tue, 5 Jun 2012, Simon Riggs wrote:

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?

That's enough for me.

So is it planned to apply that patch for 9.2 ?

Thanks,
S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sergey Koposov (#14)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

Sergey Koposov <koposov@ast.cam.ac.uk> writes:

On Tue, 5 Jun 2012, Simon Riggs wrote:

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers. �That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?

That's enough for me.

Say what? That's a performance result and proves not a damn thing about
safety.

regards, tom lane

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#15)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 7 June 2012 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Sergey Koposov <koposov@ast.cam.ac.uk> writes:

On Tue, 5 Jun 2012, Simon Riggs wrote:

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?

That's enough for me.

Say what?  That's a performance result and proves not a damn thing about
safety.

Of course not.

Based on the rationale explained in the code comments in the patch, it
seems like a reasonable thing to me now.

The argument was that since we hold AccessExclusiveLock on the
relation, no other agent can be reading in new parts of the table into
new buffers, so the only change to a buffer would be away from the
dropping relation, in which case we wouldn't care. Which seems correct
to me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#16)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

Simon Riggs <simon@2ndQuadrant.com> writes:

On 7 June 2012 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Say what? �That's a performance result and proves not a damn thing about
safety.

Of course not.

Based on the rationale explained in the code comments in the patch, it
seems like a reasonable thing to me now.

The argument was that since we hold AccessExclusiveLock on the
relation, no other agent can be reading in new parts of the table into
new buffers, so the only change to a buffer would be away from the
dropping relation, in which case we wouldn't care. Which seems correct
to me.

Oh, I must be confused about which patch we are talking about --- I
thought this was in reference to some of the WIP ideas that were being
thrown about with respect to using lock-free access primitives. Which
patch are you proposing for commit now, exactly?

regards, tom lane

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 7 June 2012 17:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 7 June 2012 14:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Say what?  That's a performance result and proves not a damn thing about
safety.

Of course not.

Based on the rationale explained in the code comments in the patch, it
seems like a reasonable thing to me now.

The argument was that since we hold AccessExclusiveLock on the
relation, no other agent can be reading in new parts of the table into
new buffers, so the only change to a buffer would be away from the
dropping relation, in which case we wouldn't care. Which seems correct
to me.

Oh, I must be confused about which patch we are talking about --- I
thought this was in reference to some of the WIP ideas that were being
thrown about with respect to using lock-free access primitives.  Which
patch are you proposing for commit now, exactly?

Both of these, as attached up thread.

Simon's patch - dropallforks.v1.patch
Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
(needs a little tidyup)

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#18)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

Simon Riggs <simon@2ndQuadrant.com> writes:

On 7 June 2012 17:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh, I must be confused about which patch we are talking about --- I
thought this was in reference to some of the WIP ideas that were being
thrown about with respect to using lock-free access primitives. �Which
patch are you proposing for commit now, exactly?

Both of these, as attached up thread.

Simon's patch - dropallforks.v1.patch
Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
(needs a little tidyup)

OK, will take a look.

regards, tom lane

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#12)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

Jeff Janes <jeff.janes@gmail.com> writes:

On 30 May 2012 12:10, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
grabbing the spinlocks on every buffer? It could do an unlocked test first,
and only grab the spinlock on buffers that need to be dropped.

Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.

Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers. That
seems like it should be worthwhile.

With shared_buffers set to 1GB, I see about a 2X reduction in the total
time to drop a simple table, ie
create table zit(f1 text primary key);
drop table zit;
(This table definition is chosen to ensure there's an index and a toast
table involved, so several scans of the buffer pool are needed.) The
DROP goes from about 40ms to about 20ms on a fairly recent Xeon desktop.
So I'm convinced this is a win.

I extended the patch to also cover DropDatabaseBuffers,
FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact
same type of full-pool scan loop, and committed it.

regards, tom lane

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#19)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

I wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Both of these, as attached up thread.
Simon's patch - dropallforks.v1.patch
Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
(needs a little tidyup)

OK, will take a look.

I didn't like dropallforks.v1.patch at all as presented, for several
reasons:

* Introducing an AllForks notation that only works in some contexts is
a foot-gun of large caliber. This concern is not academic: you broke
dropping of temp relations entirely, in the patch as presented, because
for temp rels DropRelFileNodeBuffers would hand off to
DropRelFileNodeLocalBuffers and the latter had not been taught about
AllForks.

* Since we have found out in this thread that the inner loop of
DropRelFileNodeBuffers is performance-critical for the cases we're
dealing with, it seems inappropriate to me to make its tests more
complex. We want simpler, and we can have simpler given that the
relation-drop case cares neither about fork nor block number.

* The patch modified behavior of XLogDropRelation, which has not been
shown to be performance-critical, and probably can't be because the
hashtable it searches should never be all that large. It certainly
doesn't matter to the speed of normal execution.

I thought it would be a lot safer and probably a little bit quicker
if we just split DropRelFileNodeBuffers into two routines, one for
the specific-fork case and one for the all-forks case; and then the
same for its main caller smgrdounlink. So I modified the patch along
those lines and committed it.

As committed, the smgrdounlinkfork case is actually dead code; it's
never called from anywhere. I left it in place just in case we want
it someday.

regards, tom lane

#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#21)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On 7 June 2012 22:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I thought it would be a lot safer and probably a little bit quicker
if we just split DropRelFileNodeBuffers into two routines, one for
the specific-fork case and one for the all-forks case; and then the
same for its main caller smgrdounlink.  So I modified the patch along
those lines and committed it.

As committed, the smgrdounlinkfork case is actually dead code; it's
never called from anywhere.  I left it in place just in case we want
it someday.

That's fine. The first version of the patch did it exactly that way.

I tried to double guess objections and so recoded it the way submitted.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#23Sergey Koposov
koposov@ast.cam.ac.uk
In reply to: Tom Lane (#20)
Re: slow dropping of tables, DropRelFileNodeBuffers, tas

On Thu, 7 Jun 2012, Tom Lane wrote:

I extended the patch to also cover DropDatabaseBuffers,
FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact
same type of full-pool scan loop, and committed it.

Thanks everybody for the patches/commits.

Sergey

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/