Relation extension scalability

Started by Andres Freundabout 11 years ago124 messageshackers

andres@anarazel.de

about 11 years ago

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I've attached my POC for heap extension, but it's really just a POC at
this point.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andres Freund (#1)

Re: Relation extension scalability

Andres Freund <andres@2ndquadrant.com> writes:

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases. And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 11 years ago

In reply to: Tom Lane (#2)

Re: Relation extension scalability

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.

Hm. Not a bad point.

And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

What I was thinking is something like this:

while (true)
{
targetBuffer = AcquireFromFSMEquivalent();

if (targetBuffer == InvalidBuffer)
targetBuffer = ExtendRelation();

LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);

if (BufferHasEnoughSpace(targetBuffer))
break;

LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{
targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;
buffer = ReadBuffer(rel, targetBlock);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(buffer);
if (PageIsNew(page))
{
PageInit(page);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
FlushBuffer(buffer);
break;
}
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
continue;
}
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andres Freund (#3)

Re: Relation extension scalability

Andres Freund <andres@2ndquadrant.com> writes:

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager. There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it. If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 11 years ago

In reply to: Tom Lane (#4)

Re: Relation extension scalability

On 2015-03-29 16:07:49 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager. There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it. If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.

That's why I'd proposed restructuring things so that the actual
extension/write to the file only happens once we have the buffer manager
exclusive lock on the individual buffer. While not trvia to implement it
doesn't look prohibitively complex.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#1)

Re: Relation extension scalability

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think. Maybe I'm misunderstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 11 years ago

In reply to: Robert Haas (#6)

Re: Relation extension scalability

On 2015-03-29 20:02:06 -0400, Robert Haas wrote:

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

Once we've initialized the block, a subsequent failure to write or
fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Robert Haas (#6)

Re: Relation extension scalability

Robert Haas <robertmhaas@gmail.com> writes:

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Yeah. As Andres says, the fsync is not an important part of that,
but we do expect that ENOSPC will happen during the initial write()
if it's going to happen.

To some extent that's an obsolete assumption, I'm afraid --- I believe
that some modern filesystems don't necessarily overwrite the previous
version of a block, which would mean that they are capable of failing
with ENOSPC even during a re-write of a previously-written block.
However, the possibility of filesystem misfeasance of that sort doesn't
excuse us from having a clear recovery strategy for failures during
relation extension.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Andres Freund (#1)

Re: Relation extension scalability

On Mon, Mar 30, 2015 at 12:26 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

In the past, I have observed in one of the Write-oriented tests that
backend's have to flush the pages by themselves many a times, so
in above situation that can lead to more severe bottleneck.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

So do this means that the problem is because of contention on extension
lock?

I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

Andres Freund

andres@anarazel.de

about 11 years ago

In reply to: Amit Kapila (#9)

Re: Relation extension scalability

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

In the past, I have observed in one of the Write-oriented tests that
backend's have to flush the pages by themselves many a times, so
in above situation that can lead to more severe bottleneck.

Yes.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

So do this means that the problem is because of contention on extension
lock?

Yes, at least commonly. Obviously the extension lock would be less of a
problem if we were better at having clean victim buffer ready.

I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

David Steele

david@pgmasters.net

about 11 years ago

In reply to: Andres Freund (#10)

Re: Relation extension scalability

On 3/30/15 6:45 AM, Andres Freund wrote:

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent. Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

--
- David Steele
david@pgmasters.net

#12

Stephen Frost

sfrost@snowman.net

about 11 years ago

In reply to: David Steele (#11)

Re: Relation extension scalability

* David Steele (david@pgmasters.net) wrote:

On 3/30/15 6:45 AM, Andres Freund wrote:

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent. Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary. Now, if someone shows up tomorrow with a patch to do
this and Andres' approach ends up not progressing, then we should
certainly consider it (in due time and with consideration to the
activities going on for 9.5, of course).

Thanks!

Stephen

#13

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Stephen Frost (#12)

Re: Relation extension scalability

On Mon, Mar 30, 2015 at 8:57 PM, Stephen Frost <sfrost@snowman.net> wrote:

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

The benefit of extending in bigger chunks in background is that backend
would need to perform such an operation at relatively lesser frequency
which in itself could be a win.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary.

Agreed, I think it is better to first see the results of current
patch on which Andres is working and then if someone is interested
and can show any real benefit with the patch to extend relation
in bigger chunks, then that might be worth consideration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Jim Nasby

Jim.Nasby@BlueTreble.com

about 11 years ago

In reply to: Amit Kapila (#13)

Re: Relation extension scalability

On 3/30/15 10:48 PM, Amit Kapila wrote:

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

The benefit of extending in bigger chunks in background is that backend
would need to perform such an operation at relatively lesser frequency
which in itself could be a win.

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll
get contiguous blocks on the storage. That's going to make a big
difference for SeqScan speed. It'd be interesting if someone with access
to some real systems could test that. In particular, seqscan of a
possibly fragmented table vs one of the same size but created at once.
For extra credit, compare to dd bs=8192 of a file of the same size as
the overall table.

What I've seen in the real world is very, very poor SeqScan performance
of tables that were relatively large. So bad that I had to SeqScan 8-16
tables in parallel to max out the IO system the same way I could with a
single dd bs=8k of a large file (in this case, something like 480MB/s).
A single SeqScan would only do something like 30MB/s.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 11 years ago

In reply to: Jim Nasby (#14)

Re: Relation extension scalability

On 02-04-2015 AM 09:24, Jim Nasby wrote:

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll get
contiguous blocks on the storage. That's going to make a big difference for
SeqScan speed. It'd be interesting if someone with access to some real systems
could test that. In particular, seqscan of a possibly fragmented table vs one
of the same size but created at once. For extra credit, compare to dd bs=8192
of a file of the same size as the overall table.

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0]/messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

Amit

[0]: /messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com
/messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Stephen Frost

sfrost@snowman.net

about 11 years ago

In reply to: Amit Langote (#15)

Re: Relation extension scalability

* Amit Langote (Langote_Amit_f8@lab.ntt.co.jp) wrote:

On 02-04-2015 AM 09:24, Jim Nasby wrote:

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll get
contiguous blocks on the storage. That's going to make a big difference for
SeqScan speed. It'd be interesting if someone with access to some real systems
could test that. In particular, seqscan of a possibly fragmented table vs one
of the same size but created at once. For extra credit, compare to dd bs=8192
of a file of the same size as the overall table.

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

As I recall, it didn't, and further, modern filesystems are pretty good
about avoiding fragmentation anyway..

I'm not saying Jim's completely off-base with this idea, I'm just not
sure that it'll really buy us much.

Thanks,

Stephen

#17

Qingqing Zhou

zhouqq@cs.toronto.edu

almost 11 years ago

In reply to: Andres Freund (#1)

Re: Relation extension scalability

On Sun, Mar 29, 2015 at 11:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

I think relaxing a global lock will fix the contention mostly.
However, several people suggested that extending with many pages have
other benefits. This hints for a more fundamental change in our
storage model. Currently we map one file per relation. While it is
simple and robust, considering partitioned table, maybe later columnar
storage are integrated into the core, this model needs some further
thoughts. Think about a 1000 partitioned table with 100 columns: that
is 100K files, no to speak of other forks - surely we can continue
challenging file system's limit or playing around vfds, but we have a
chance now to think ahead.

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Qingqing Zhou

zhouqq@cs.toronto.edu

almost 11 years ago

In reply to: Qingqing Zhou (#17)

Re: Relation extension scalability

On Fri, Apr 17, 2015 at 11:19 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.

Sorry, it might considered double-posting here but I am wondering have
we ever discussed this before? If we already have some conclusions on
this, could anyone share me a link?

Thanks,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Andres Freund (#1)

Re: Relation extension scalability

Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1][10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL) of the AXLE project. Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client tps after: tps before:
1 180.255577 210.125143
2 338.231058 391.875088
4 638.814300 405.243901
8 1126.852233 370.922271
16 1242.363623 498.487008
32 1229.648854 484.477042
48 1223.288397 468.127943
64 1198.007422 438.238119
96 1201.501278 370.556354
128 1198.554929 288.213032
196 1189.603398 193.841993
256 1144.082291 191.293781
512 643.323675 200.782105

shared_buffers = 1GB, narrow table contents:
client tps after: tps before:
1 191.137410 210.787214
2 351.293017 384.086634
4 649.800991 420.703149
8 1103.770749 355.947915
16 1287.192256 489.050768
32 1226.329585 464.936427
48 1187.266489 443.386440
64 1182.698974 402.251258
96 1208.315983 331.290851
128 1183.469635 269.250601
196 1202.847382 202.788617
256 1177.924515 190.876852
512 572.457773 192.413191

1
shared_buffers = 48GB, wide table contents:
client tps after: tps before:
1 59.685215 68.445331
2 102.034688 103.210277
4 179.434065 78.982315
8 222.613727 76.195353
16 232.162484 77.520265
32 231.979136 71.654421
48 231.981216 64.730114
64 230.955979 57.444215
96 228.016910 56.324725
128 227.693947 45.701038
196 227.410386 37.138537
256 224.626948 35.265530
512 105.356439 34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound. I'd like to see some comments before deciding
whether to carry on.

Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]: [10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)