New FSM allocation policy

Started by Heikki Linnakangasover 17 years ago10 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

The way my rewritten FSM works at the moment is that pages are handed
out of the FSM in random order, to spread the load of multiple backends
to different pages. That's good for spreading the load, but it occurred
to me while observing a test case that inserts a lot of rows to an
almost empty table with plenty of free space, that that sucks for the
case of a single backend populating a table. Currently, FSM will hand
out pages in sequential order, so from OS point of view we're reading
and writing the pages sequentially. If the pages are handed out in
random order, we'll do random I/O instead.

Fortunately there's an easy fix for that. If we optimize
RecordAndGetPageWithFreeSpace so that it will always return the next
page if it has enough space, we'll be doing sequential I/O again. That's
trivial as long as the next heap page is on the same FSM page, and
probably not too hard even if it's not. If we limit this optimization to
within the same FSM page, we'll effectively be filling fully a 32MB stripes

Thoughts?

I'm running more tests, and there's more issues that need discussion,
but I'll start separate threads for each. I'll also post an updated
patch separately.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2Gregory Stark
stark@enterprisedb.com
In reply to: Heikki Linnakangas (#1)
Re: New FSM allocation policy

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Fortunately there's an easy fix for that. If we optimize
RecordAndGetPageWithFreeSpace so that it will always return the next page if it
has enough space, we'll be doing sequential I/O again. That's trivial as long
as the next heap page is on the same FSM page, and probably not too hard even
if it's not. If we limit this optimization to within the same FSM page, we'll
effectively be filling fully a 32MB stripes

Starting from an arbitrary block that would be on average a 16MB stripe.

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#2)
Re: New FSM allocation policy

Gregory Stark <stark@enterprisedb.com> writes:

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

I think the trick here is how to also preserve the property that
different backends tend to be inserting into different pages. There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

However, it would probably be sufficient to do that for some pretty
small number of tables, allowing a fixed-size shared memory area to be
sufficient.

regards, tom lane

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: New FSM allocation policy

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

I think the trick here is how to also preserve the property that
different backends tend to be inserting into different pages.

Yep. If we just always look at the next page, there's the danger of
having multiple backends compete for the same pages again.

There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

However, it would probably be sufficient to do that for some pretty
small number of tables, allowing a fixed-size shared memory area to be
sufficient.

.. similar to how we handle synchronized seqscans. Yep, that's one
option. I wish we could get rid of the shared memory area and lock
altogether, though.

We could put an extra field on the FSM root page. Hmm, it doesn't
actually need to be a single global field, so we could have a field like
that on every FSM page, for better concurrency. So the algorithm for the
set+search operation becomes:

1. Lock FSM page for the old heap page X
2. Set the value for X
3. See if the page pointed to by (fsmpage->nextPtr++) has enough free
space. If it does, return that.
4. Otherwise, start searching from root, with random traversal.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#4)
Re: New FSM allocation policy

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

We could put an extra field on the FSM root page. Hmm, it doesn't
actually need to be a single global field, so we could have a field like
that on every FSM page, for better concurrency.

Yeah, that would work, and it is only a hint so you needn't WAL-log
changing it.

regards, tom lane

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#4)
Re: New FSM allocation policy

On Fri, 2008-08-29 at 18:55 +0300, Heikki Linnakangas wrote:

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

I think the trick here is how to also preserve the property that
different backends tend to be inserting into different pages.

Yep. If we just always look at the next page, there's the danger of
having multiple backends compete for the same pages again.

Can the FSM hand out page ranges? That way we would be able to use the
next page logic without fear of competition.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#5)
2 attachment(s)
Re: New FSM allocation policy

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

We could put an extra field on the FSM root page. Hmm, it doesn't
actually need to be a single global field, so we could have a field like
that on every FSM page, for better concurrency.

Yeah, that would work, and it is only a hint so you needn't WAL-log
changing it.

Done. That seems to have helped a lot with the case of two concurrent
backends bulk inserting to a table.

I'm now happy with the bulk insert performance and behavior. I've been
using the attached test case to measure that. It uses pgbench to measure
the speed of bulk inserting one million rows, with these variations:
- one client, start with a freshly-created table (i.e FSM is empty, and
the table is extended)
- same, but with two clients
- one client, start with a pre-existing, but empty, table. So all
requests for free space are satisfied by the FSM, the table is not extended.
- same, but with two clients.

Running that test on my laptop, with data directory on a RAM drive to
measure just the CPU overhead and concurrency, it looks like the patched
version is now on par with the current implementation. Looking at the
pattern that pages are filled, by polling pg_freespace('tmp') while the
test is running, the pages are being filled sequentially, with the
target pages of the two backends interleaved, which is as expected and
emulates the behavior of the current implementation. I haven't tried
larger I/O bound tests yet, but because the access pattern is now the
same as without the patch, I would expect the performance to be comparable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fsm-lazy-2.patch.gzapplication/x-gzip; name=fsm-lazy-2.patch.gzDownload
populatebench.shapplication/x-sh; name=populatebench.shDownload
#8Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#1)
Re: New FSM allocation policy

Heikki Linnakangas wrote:

The way my rewritten FSM works at the moment is that pages are handed
out of the FSM in random order, to spread the load of multiple backends
to different pages. That's good for spreading the load, but it occurred
to me while observing a test case that inserts a lot of rows to an
almost empty table with plenty of free space, that that sucks for the
case of a single backend populating a table. Currently, FSM will hand
out pages in sequential order, so from OS point of view we're reading
and writing the pages sequentially. If the pages are handed out in
random order, we'll do random I/O instead.

Fortunately there's an easy fix for that. If we optimize
RecordAndGetPageWithFreeSpace so that it will always return the next
page if it has enough space, we'll be doing sequential I/O again. That's
trivial as long as the next heap page is on the same FSM page, and
probably not too hard even if it's not. If we limit this optimization to
within the same FSM page, we'll effectively be filling fully a 32MB stripes

Thoughts?

I'm running more tests, and there's more issues that need discussion,
but I'll start separate threads for each. I'll also post an updated
patch separately.

One other thing to keep in mind is that VACUUM can reduce a table's size
if the trailing blocks are empty, so there is some gain if the earlier
parts of the table are preferred for inserts.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#9Decibel!
decibel@decibel.org
In reply to: Bruce Momjian (#8)
1 attachment(s)
Re: New FSM allocation policy

On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:

Fortunately there's an easy fix for that. If we optimize
RecordAndGetPageWithFreeSpace so that it will always return the next
page if it has enough space, we'll be doing sequential I/O again.
That's
trivial as long as the next heap page is on the same FSM page, and
probably not too hard even if it's not. If we limit this
optimization to
within the same FSM page, we'll effectively be filling fully a
32MB stripes

Thoughts?

I'm running more tests, and there's more issues that need discussion,
but I'll start separate threads for each. I'll also post an updated
patch separately.

One other thing to keep in mind is that VACUUM can reduce a table's
size
if the trailing blocks are empty, so there is some gain if the earlier
parts of the table are preferred for inserts.

Yeah; I would actually really, really like to see a mode you could
set on a table that says "I want to try and shrink this table". One
of the things that would mean is that the FSM should prefer pages at
the beginning of the heap.

Also related to this is the idea of asking the FSM for pages within a
specific range so that you can try and maintain cluster order on a
table. You would look in the clustering index for the closest value
to your key and where it is in the heap and then ask for a page in
that neighborhood. (You'd probably want to look at more than just one
index tuple, but you get the idea).
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Decibel! (#9)
Re: New FSM allocation policy

Decibel! wrote:

On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:

One other thing to keep in mind is that VACUUM can reduce a table's size
if the trailing blocks are empty, so there is some gain if the earlier
parts of the table are preferred for inserts.

Yeah; I would actually really, really like to see a mode you could set
on a table that says "I want to try and shrink this table". One of the
things that would mean is that the FSM should prefer pages at the
beginning of the heap.

Not sure how exactly that should work, but it should be pretty easy to
do with the new FSM data structure. Perhaps VACUUM should just reset the
"next pointers" as it goes.

Also related to this is the idea of asking the FSM for pages within a
specific range so that you can try and maintain cluster order on a
table. You would look in the clustering index for the closest value to
your key and where it is in the heap and then ask for a page in that
neighborhood. (You'd probably want to look at more than just one index
tuple, but you get the idea).

The new FSM data structure should make that much easier to implement as
well, as it supports naturally the operation "give me page with X bytes
of free space, as close as possible to page Y".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com