New FSM patch

Started by Heikki Linnakangasalmost 18 years ago32 messageshackers

heikki.linnakangas@enterprisedb.com

almost 18 years ago

Here's an updated FSM patch. Changes since last patch:

- Per comments and discussion with Simon, I've changed the "bubble up"
behavior so that when a bottom-level page is updated, if the amount of
free space was decreased, the change is not immediately bubbled up to
upper page. Instead, searchers that traverse down the tree will update
the upper pages when they see that they're out of sync. This should
alleviate the worry that we need to keep a bottom-level page exclusively
locked across I/O.

- Page-level routines have been split to a separate file fsmpage.c.
While there isn't that much code in it, it makes the separation between
functions that operate on a single page and functions that operate
across pages more clear.

- Fixed some minor bugs.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Heikki Linnakangas (#1)

Re: New FSM patch

On Fri, 2008-08-29 at 10:47 +0300, Heikki Linnakangas wrote:

- Per comments and discussion with Simon, I've changed the "bubble up"
behavior so that when a bottom-level page is updated, if the amount of
free space was decreased, the change is not immediately bubbled up to
upper page. Instead, searchers that traverse down the tree will update
the upper pages when they see that they're out of sync. This should
alleviate the worry that we need to keep a bottom-level page exclusively
locked across I/O.

Thanks for taking time with the new FSM.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Heikki Linnakangas (#1)

Re: New FSM patch

On Fri, 2008-08-29 at 10:47 +0300, Heikki Linnakangas wrote:

Here's an updated FSM patch.

Can I check some aspects of this related to Hot Standby? Some of them
sound obvious, but worth double checking.

* There will be no need to read FSM by any normal operation of a
read-only transaction, so locking correctness considerations can
possibly be ignored during recovery. pg_freespacemap exists though:
would we need to prevent that from executing during recovery, or will
the FSM be fully readable? i.e. does redo take appropriate locks already
(I don't see any Cleanup locks being required).

* FSM will be continuously maintained during recovery, so FSM will now
be correct and immediately available when recovery completes?

* There are no cases where a screwed-up FSM will crash either recovery
(FATAL+) or halt normal operation (PANIC)?

* incomplete action cleanup is fairly cheap and doesn't rely on the FSM
being searchable to correct the error? This last is a hard one...

Do we have the concept of a invalid/corrupt FSM? What happens if the
logic goes wrong and we have a corrupt page? Will that mean we can't
complete actions against the heap?

Are there really any changes to these files?
src/include/storage/bufmgr.h
src/include/postmaster/bgwriter.h

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Simon Riggs (#3)

Re: New FSM patch

Thanks for the review!

Simon Riggs wrote:

Can I check some aspects of this related to Hot Standby? Some of them
sound obvious, but worth double checking.

* There will be no need to read FSM by any normal operation of a
read-only transaction, so locking correctness considerations can
possibly be ignored during recovery.

Correct.

A HOT prune operation doesn't currently update the FSM, but if it did,
that would be a case of a read-only transaction updating the FSM. But we
can't prune anyway in a hot standby.

pg_freespacemap exists though:
would we need to prevent that from executing during recovery, or will
the FSM be fully readable? i.e. does redo take appropriate locks already
(I don't see any Cleanup locks being required).

pg_freespacemap, the contrib module? Yeah, the FSM should be fully readable.

During normal operation, when a bottom level page is updated, and the
update needs to be bubbled up, the upper level page is pinned and locked
before the lock on the lower level page is released. That interlocking
is not done during WAL replay, and the lock on the lower level page is
released before locking the upper page. It's not required during WAL
replay, as there's no concurrent updates to the FSM.

* FSM will be continuously maintained during recovery, so FSM will now
be correct and immediately available when recovery completes?

Correct,

* There are no cases where a screwed-up FSM will crash either recovery
(FATAL+) or halt normal operation (PANIC)?

Hmm, there's no explicit elog(FATAL/PANIC) calls, but if the FSM is
really corrupt, you can probably get a segfault. That should be fixable
by adding more sanity checks, though.

* incomplete action cleanup is fairly cheap and doesn't rely on the FSM
being searchable to correct the error? This last is a hard one...

Correct.

Do we have the concept of a invalid/corrupt FSM? What happens if the
logic goes wrong and we have a corrupt page? Will that mean we can't
complete actions against the heap?

Some scenarios I can think of:

Scenario: The binary tree on a page is corrupt, so that the value of an
upper node is < Max(leftchild, rightchild).
Consequence: Searchers won't see the free space below that node, and
will look elsewhere.

Scenario: The binary tree on a page is corrupt, so that the value of an
upper node is > Max(leftchild, rightchild).
Consequence: Searchers will notice the corruption while trying to
traverse down that path, and throw an elog(WARNING) in search_avail().
fsm_search will retry the search, and will in worst case go into an
infinite loop. That's obviously not good. We could automatically fix the
upper nodes of the tree, but that would wipe evidence that would be
useful in debugging.

Scenario: An upper level page is corrupt, claiming that there's no free
space on a lower level page, while there actually is. (the opposite,
where an upper level page claims that there *is* free space on a lower
level page, while there actually isn't, is now normal. The searcher will
update the upper level page in that case)
Consequence: A searcher won't see that free space, and will look elsewhere.

Scenario: An upper level page is corrupt, claiming that there is free
space on a lower level page that doesn't exist.
Consequence: Searchers will elog(ERROR), trying to read the non-existent
FSM page.

The 3rd scenario would lead to heap inserts/updates failing. We could
avoid that by checking that the page exists with
RelationGetNumberOfBlocks(), but I'm not sure if it's worth the cost.

Are there really any changes to these files?
src/include/storage/bufmgr.h
src/include/postmaster/bgwriter.h

Hmm, apparently not.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Heikki Linnakangas (#4)

Re: New FSM patch

On Thu, 2008-09-04 at 11:07 +0300, Heikki Linnakangas wrote:

Thanks for the review!

Not as thorough as I would have liked, I must admit.

Thanks for the other confirmations.

Scenario: The binary tree on a page is corrupt, so that the value of an
upper node is > Max(leftchild, rightchild).
Consequence: Searchers will notice the corruption while trying to
traverse down that path, and throw an elog(WARNING) in search_avail().
fsm_search will retry the search, and will in worst case go into an
infinite loop. That's obviously not good. We could automatically fix the
upper nodes of the tree, but that would wipe evidence that would be
useful in debugging.

We probably need to break out of infinite loops, especially ones that
output warning messages on each loop. :-)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Simon Riggs (#5)

Re: New FSM patch

Simon Riggs wrote:

On Thu, 2008-09-04 at 11:07 +0300, Heikki Linnakangas wrote:

Scenario: The binary tree on a page is corrupt, so that the value of an
upper node is > Max(leftchild, rightchild).
Consequence: Searchers will notice the corruption while trying to
traverse down that path, and throw an elog(WARNING) in search_avail().
fsm_search will retry the search, and will in worst case go into an
infinite loop. That's obviously not good. We could automatically fix the
upper nodes of the tree, but that would wipe evidence that would be
useful in debugging.

We probably need to break out of infinite loops, especially ones that
output warning messages on each loop. :-)

Yep. I turned that warning into an error in the latest patch I just
posted elsewhere in this thread.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#1)

Re: New FSM patch

Heikki Linnakangas napsal(a):

Here's an updated FSM patch. Changes since last patch:

Yesterday, I started to reviewing your patch. At the beginning I have
general questions:

1) If I understand correctly the main goal is to improve FSM to cover
all pages in file which is useful for huge database.

2) Did you perform any benchmark? Is there any performance improvement
or penalty?

3) How it works when database has many active parallel connections?

Zdenek

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Zdenek Kotala (#7)

Re: New FSM patch

Zdenek Kotala wrote:

Yesterday, I started to reviewing your patch.

Thanks!

1) If I understand correctly the main goal is to improve FSM to cover
all pages in file which is useful for huge database.

That's not a goal per se, though it's true that the new FSM does cover
all pages. The goals are to:
- eliminate max_fsm_pages and max_fsm_relations GUC variables, so that
there's one thing less to configure
- make the FSM immediately available and useful after recovery (eg. warm
standby)
- make it possible to retail update the FSM, which will be needed for
partial vacuum

2) Did you perform any benchmark? Is there any performance improvement
or penalty?

Working on it.. I've benchmarked some bulk-insertion scenarios, and the
new FSM is now comparable to the current implementation on those tests.
See the o

I've also been working on a low level benchmark using a C user-defined
function that exercises just the FSM, showing the very raw CPU
performance vs. current implementation. More on that later, but ATM it
looks like the new implementation can be faster or slower than the
current one, depending on the table size.

The biggest potential performance issue, however, is the fact that the
new FSM implementation is WAL-logged. That shows up dramatically in the
raw test where there's no other activity than FSM lookups and updates,
but will be much less interesting in real life where FSM lookups are
always related to some other updates which are WAL-logged anyway.

I also ran some DBT-2 tests without think times, with a small number of
warehouses. But the results of that had such a high variability from
test to test, that any difference in FSM speed would've been lost in the
noise.

Do you still have the iGen setup available? Want to give it a shot?

3) How it works when database has many active parallel connections?

The new FSM should in principle scale better than the old one. However,
Simon raised a worry about the WAL-logging: WALInserLock can already
become the bottleneck in OLTP-scenarios with very high load and many
CPUs. The FSM isn't any worse than other actions that generate WAL, but
naturally if you're bottlenecked by the WAL lock or bandwidth, any
increase in WAL traffic will show up as an overall performance loss.

I'm not too worried about that, myself, because in typical scenarios the
extra WAL traffic generated by the FSM should be insignificant in volume
compared to all the other WAL traffic. But Simon will probably demand
some hard evidence of that ;-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#8)

Re: New FSM patch

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Do you still have the iGen setup available? Want to give it a shot?

Not sure if I have it configured, need to check. But I'll try it or I'll ask
Jignesh or Paul if they have a free time. They are real benchmark gurus.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#10

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#1)

Re: New FSM patch

I have question about FSM structure on the page. If I look into the code it
seems to me that tree structure is "degenerated" on the right side. It is
probably space optimization because complete tree needs BLCKSZ/2 space on the
page and rest is empty.

Is my assumption correct? If yes maybe extra note in fsm_internal.h about it
could be helpful.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#11

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Zdenek Kotala (#10)

Re: New FSM patch

Zdenek Kotala wrote:

I have question about FSM structure on the page. If I look into the code
it seems to me that tree structure is "degenerated" on the right side.
It is probably space optimization because complete tree needs BLCKSZ/2
space on the page and rest is empty.

Is my assumption correct? If yes maybe extra note in fsm_internal.h
about it could be helpful.

Yes, that's right. I'll add a note on that if it helps.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#1)

Re: New FSM patch

Another question:

Does we need random_bool to spread workload? It seems to me a useless, because
it also invokes one backend to use more pages instead of using one which is
already in buffer cache. I think that it should generate a lot of extra i/o. Do
not forget WAL full page write for firstime modified page.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#13

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Zdenek Kotala (#12)

Re: New FSM patch

Zdenek Kotala wrote:

Does we need random_bool to spread workload? It seems to me a useless,
because it also invokes one backend to use more pages instead of using
one which is already in buffer cache.I think that it should generate a
lot of extra i/o. Do not forget WAL full page write for firstime
modified page.

random_bool() is gone in the latest version of the patch, in favor of a
"next pointer". You must be looking at an old version, and I must've
forgotten to update the link in the Wiki. That change was discussed in
the "New FSM allocation policy" thread.

Anyway, here's is the new version for your convenience, and I also added
a paragraph to the README, mentioning that the tree is degenerated from
the right.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Heikki Linnakangas (#8)

Re: New FSM patch

Heikki Linnakangas wrote:

I've also been working on a low level benchmark using a C user-defined
function that exercises just the FSM, showing the very raw CPU
performance vs. current implementation. More on that later, but ATM it
looks like the new implementation can be faster or slower than the
current one, depending on the table size.

Let me describe this test case first:
- The test program calls RecordAndGetPageWithFreeSpace in a tight loop,
with random values. There's no activity to the heap. In normal usage,
the time spent in RecordAndGetWithFreeSpace is minuscule compared to the
heap and index updates that cause RecordAndGetWithFreeSpace to be called.
- WAL was placed on a RAM drive. This is of course not how people set up
their database servers, but the point of this test was to measure CPU
speed and scalability. The impact of writing extra WAL is significant
and needs to be taken into account, but that's a separate test and
discussion, and needs to be considered in comparison to the WAL written
by heap and index updates.

That said, the test results are pretty interesting.

I ran the test using a custom scripts with pgbench. I ran it with
different table sizes, and with 1 or 2 clients, on CVS HEAD and a
patched version. The unit is "thousands of RecordAndGetPageWithFreeSpace
calls per second":

Table size Patched CVS HEAD
1 clnt 2 clnts 1 clnt 2 clients
8 kB 4.59 3.45 62.83 26.85
336 kB 13.85 6.43 41.8 16.55
3336 kB 14.96 6.3 22.45 10.55
33336 kB 14.85 6.56 5.44 4.08
333336 kB 14.48 11.04 0.79 0.74
3333336 kB 12.68 11.5 0.07 0.07
33333336 kB 7.67 5.37 0.05 0.05

The big surprise to me was that performance on CVS HEAD tanks as the
table size increases. One possible explanation is that searches for X
bytes of free space, for a very high X, will not find any matches, and
the current FSM implementation ends up scanning through the whole FSM
list for that relation.

Another surprise was how badly both implementations scale. On CVS HEAD,
I expected the performance to be roughly the same with 1 and 2 clients,
because all access to the FSM is serialized on the FreeSpaceLock. But
adding the 2nd client not only didn't help, but it actually made the
performance much worse than with a single client. Context switching or
cache line contention, perhaps? The new FSM implementation shows the
same effect, which was an even bigger surprise. At table sizes > 32 MB,
the FSM no longer fits on a single FSM page, so I expected almost a
linear speed up with bigger table sizes from using multiple clients.
That's not happening, and I don't know why. Although, going from 33MB to
333 MB, the performance with 2 clients almost doubles, but it still
doesn't exceed that with 1 client.

Going from 3 GB to 33 GB, the performance of the new implementation
drops. I don't know why, I think I'll run some more tests with big table
sizes to investigate that a bit more. The performance in the old
implementation stays almost the same at that point; I believe that's
because max_fsm_pages is exceeded at that point.

All in all, this isn't a very realistic test case, but it's interesting
nevertheless. I'm happy with the performance of the new FSM on this
test, as it's in the same ballpark as the old one, even though it's not
quite what I expected.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#14)

Re: New FSM patch

Heikki Linnakangas napsal(a):

Heikki Linnakangas wrote:

I've also been working on a low level benchmark using a C user-defined
function that exercises just the FSM, showing the very raw CPU
performance vs. current implementation. More on that later, but ATM it
looks like the new implementation can be faster or slower than the
current one, depending on the table size.

Let me describe this test case first:
- The test program calls RecordAndGetPageWithFreeSpace in a tight loop,
with random values. There's no activity to the heap. In normal usage,
the time spent in RecordAndGetWithFreeSpace is minuscule compared to the
heap and index updates that cause RecordAndGetWithFreeSpace to be called.
- WAL was placed on a RAM drive. This is of course not how people set up
their database servers, but the point of this test was to measure CPU
speed and scalability. The impact of writing extra WAL is significant
and needs to be taken into account, but that's a separate test and
discussion, and needs to be considered in comparison to the WAL written
by heap and index updates.

That said, the test results are pretty interesting.

I ran the test using a custom scripts with pgbench. I ran it with
different table sizes, and with 1 or 2 clients, on CVS HEAD and a
patched version. The unit is "thousands of RecordAndGetPageWithFreeSpace
calls per second":

Table size Patched CVS HEAD
1 clnt 2 clnts 1 clnt 2 clients
8 kB 4.59 3.45 62.83 26.85
336 kB 13.85 6.43 41.8 16.55
3336 kB 14.96 6.3 22.45 10.55
33336 kB 14.85 6.56 5.44 4.08
333336 kB 14.48 11.04 0.79 0.74
3333336 kB 12.68 11.5 0.07 0.07
33333336 kB 7.67 5.37 0.05 0.05

The big surprise to me was that performance on CVS HEAD tanks as the
table size increases. One possible explanation is that searches for X
bytes of free space, for a very high X, will not find any matches, and
the current FSM implementation ends up scanning through the whole FSM
list for that relation.

Another surprise was how badly both implementations scale. On CVS HEAD,
I expected the performance to be roughly the same with 1 and 2 clients,
because all access to the FSM is serialized on the FreeSpaceLock. But
adding the 2nd client not only didn't help, but it actually made the
performance much worse than with a single client. Context switching or
cache line contention, perhaps?

The new FSM implementation shows the
same effect, which was an even bigger surprise. At table sizes > 32 MB,
the FSM no longer fits on a single FSM page, so I expected almost a
linear speed up with bigger table sizes from using multiple clients.
That's not happening, and I don't know why. Although, going from 33MB to
333 MB, the performance with 2 clients almost doubles, but it still
doesn't exceed that with 1 client.

It looks likes that there are lot of lock issues on FSM pages. When number of
FSM pages is increased then number of collisions is lower. It is probably why 2
clients significantly speed up between 33MB and 333MB. I think it is time to
take DTrace ;-).
Do you have any machine with DTrace support? If not send me your test suit and I
will try it run on my machine.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Heikki Linnakangas (#14)

Re: New FSM patch

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Let me describe this test case first:
- The test program calls RecordAndGetPageWithFreeSpace in a tight loop,
with random values.

What's the distribution of the random values, exactly? In particular,
how do the request sizes compare to available free space per-page?

The design intent for FSM was that we'd not bother to record pages that
have less free space than the average request size, so as to (usually)
avoid the problem of uselessly searching a lot of entries. I can't tell
whether your test case models that behavior at all. If it does then
there may be something else that needs fixing.

regards, tom lane

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Zdenek Kotala (#15)

Re: New FSM patch

Zdenek Kotala wrote:

It looks likes that there are lot of lock issues on FSM pages. When
number of FSM pages is increased then number of collisions is lower. It
is probably why 2 clients significantly speed up between 33MB and 333MB.

Yes, that's what I thought as well. With table size under 33 MB, the FSM
consists of just one (bottom-level) FSM page,

I think it is time to take DTrace ;-).
Do you have any machine with DTrace support?

No.

If not send me your test
suit and I will try it run on my machine.

Sure, here you are. tests.sh is the main script to run. You'll need to
adjusts the paths there for your environment.

As it is, the tests will take many hours to run, so you'll probably want
to modify tests.sh and pgbenchtests.sh to reduce the number of
iterations. At least on my server, the variance in the numbers was very
small, so repeating the tests 4 times in tests.sh is probably overkill.

Thanks!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Tom Lane (#16)

Re: New FSM patch

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Let me describe this test case first:
- The test program calls RecordAndGetPageWithFreeSpace in a tight loop,
with random values.

What's the distribution of the random values, exactly? In particular,
how do the request sizes compare to available free space per-page?

The request, and "old avail" sizes are in the range of 0-8100
(random()%8100).

The design intent for FSM was that we'd not bother to record pages that
have less free space than the average request size, so as to (usually)
avoid the problem of uselessly searching a lot of entries. I can't tell
whether your test case models that behavior at all. If it does then
there may be something else that needs fixing.

Probably not. The test case starts with a table that's practically
empty, so all pages are put into the FSM.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#8)

Re: New FSM patch

Hi Heikki,

I'm still work on performance test, but I have following
comments/questions/suggestion:

1)
#define NodesPerPage (BLCKSZ - SizeOfPageHeaderData - offsetof(FSMPageData,
fp_nodes))

should be

#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
offsetof(FSMPageData, fp_nodes))

See how PageGetContents is defined

2) I suggest to renema following functions:

GetParentNo -> FSMGetParentPageNo
GetChildNo -> FSMGetChildPageNo
GetFSMBlockNumber -> FSMGetBlockNumber

3) I'm not happy much with idea that page contains data and they are
"invisible". special, lower or upper is unset. It seems like empty page. I know
that it is used in hash index implementation as well, but it could be fixed too.

I suggest to set special and upper correctly (or only upper). lower should
indicate that there are not linp.

4) I suggest to create structure

struct foo {
int level;
int logpageno;
int slot;
}

5) I see potential infinite recursive loop in fsm_search.

6) Does FANOUT^4 fit into int? (by the way what FANOUT means?)

And there are more comments on rcodereview:

pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment45>

Strange comment? Looks like copy paste error.

pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment47>

?RELKIND_INDEX?

pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment40>

Extra space

pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment41>

Extra space

pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment42>

Extra space

pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment43>

Extra space

pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment44>

Extra space

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment37>

Use shift, however compileer could optimize it anyway.

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment38>

Why? ;-)

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment39>

What's happen if FSM_FORKNUM does not exist?

pgsql/src/include/storage/bufmgr.h
<http://reviewdemo.postgresql.org/r/27/#comment36>

Need consolidate - forknum vs blockNum, zeroPage

pgsql/src/include/storage/freespace.h
<http://reviewdemo.postgresql.org/r/27/#comment35>

Cleanup

pgsql/src/include/storage/lwlock.h
<http://reviewdemo.postgresql.org/r/27/#comment49>

Maybe better to use RESERVED to preserve lock numbers. It helps to DTrace
script be more backward compatible.

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#20

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Zdenek Kotala (#19)

Re: New FSM patch

Zdenek Kotala wrote:

I'm still work on performance test, but I have following
comments/questions/suggestion:

Thanks!

1)
#define NodesPerPage (BLCKSZ - SizeOfPageHeaderData -
offsetof(FSMPageData, fp_nodes))

should be

#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
offsetof(FSMPageData, fp_nodes))

See how PageGetContents is defined

Yes, good catch.

2) I suggest to renema following functions:

GetParentNo -> FSMGetParentPageNo
GetChildNo -> FSMGetChildPageNo
GetFSMBlockNumber -> FSMGetBlockNumber

Well, they're just static functions. But sure, why not.

3) I'm not happy much with idea that page contains data and they are
"invisible". special, lower or upper is unset. It seems like empty page.
I know that it is used in hash index implementation as well, but it
could be fixed too.

I suggest to set special and upper correctly (or only upper). lower
should indicate that there are not linp.

Hmm. In B-tree metapage, pd_lower is set to point to the end of the
struct that's stored there. It's because that allows the xlog code to
skip the unused space there in full page images, but that's not
applicable for FSM pages.

I think I'll fix that so that the data is stored in the special part,
and the special part fills the whole page.

4) I suggest to create structure

struct foo {
int level;
int logpageno;
int slot;
}

Yes, that might be helpful.

5) I see potential infinite recursive loop in fsm_search.

Yes, that's quite subtle. The recursion should end eventually, because
whenever we reach a dead-end when we descend the tree, we fix the upper
nodes so that we shouldn't take that dead-end path on the next iteration.

That said, perhaps it would be a good idea to put a counter there and
stop after say a few thousand iterations, just in case.. In any case,
looks like it needs more comments.

I think I'll restructure that into a loop, instead of recursion.

6) Does FANOUT^4 fit into int? (by the way what FANOUT means?)

FANOUT is just an alias for LeafNodesPerPage. It's the number of child
pages a non-leaf FSM page has (or can have).

No, FANOUT^4 doesn't fit in int, good catch. Actually, FANOUTPOWERS
table doesn't need to go that far, so that's just a leftover. It only
needs to have DEPTH elements. However, we have the same problem if
DEPTH==3, FANOUT^4 will not fit into int. I put a comment there.
Ideally, the 4th element would be #iffed out, but I couldn't immediately
figure out how to do that.

And there are more comments on rcodereview:

pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment45>

Strange comment? Looks like copy paste error.

That function, setNewRelfilenode() is used for heaps as well, even
though it's in index.c. I'll phrase the comment better..

pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment47>

?RELKIND_INDEX?

No, that's correct, see above. The FSM is only created for heaps there,
indexes are responsible for creating their own FSM if they need one.
Hash indexes for example don't need one.

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment37>

Use shift, however compileer could optimize it anyway.

I think I'll leave it to the compiler, for the sake of readibility.

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment38>

Why? ;-)

:-) Comment updated to:
/* Can't ask for more space than the highest category represents */

pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment39>

What's happen if FSM_FORKNUM does not exist?

smgrnblocks() will throw an error, I believe. Users of the FSM are
responsible to create the FSM fork if they need it.

pgsql/src/include/storage/bufmgr.h
<http://reviewdemo.postgresql.org/r/27/#comment36>

Need consolidate - forknum vs blockNum, zeroPage

What do you mean?

pgsql/src/include/storage/lwlock.h
<http://reviewdemo.postgresql.org/r/27/#comment49>

Maybe better to use RESERVED to preserve lock numbers. It helps to
DTrace script be more backward compatible.

Hmm, each lightweight lock uses a few bytes of shared memory, but I
guess that's insignificant. I'll do that and add a comment explaining
why that's done.

Here's a new patch, updated per your comments.

PS. ReviewBoard seems to be quite nice for pointing out small changes
like that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Heikki Linnakangas (#14)

#22

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago