Pre-alloc ListCell's optimization

Started by Stephen Frostabout 15 years ago26 messageshackers

sfrost@snowman.net

about 15 years ago

Greetings,

Someone (*cough*Haas*cough) made a claim over beers at PGCon that it
would be very difficult to come up with a way to pre-allocate List
entries and maintain the current List API. I'll admit that it wasn't
quite as trivial as I had *hoped*, but attached is a proof-of-concept
patch which does it.

A couple of notes regarding the patch:

First, it uses ffs(), which might not be fully portable.. We could
certainly implement the same thing in userspace and use ffs() when
it's available.

Second, there are a couple of bugs (or at least, I'll characterize
them that way) where we're pfree'ing a list which has been passed to
list_concat(). That's not strictly legal as either argument passed to
list_concat() could be returned and so one might end up free'ing the
list pointer that was just returned. Those pfree's are commented out
in this patch, but really should probably just be removed or replaced
with 'right' code (if it's critical to free this stuff).

Third, COPY_NODE_CELL() wasn't used anywhere other than _copyList and
would be difficult to keep as a macro given the way allocations happen
now for lists. It's no longer being used at all and therefore should
really be removed.

Fourth, the current implementation goes ahead and allocates 8
ListCell's, which quadruples the size of a List (40 bytes to 168
bytes, assuming 64bit ints). I don't see that as really being an
issue, but perhaps others would disagree.

Fifth, list_concat() has become more like list_concat_unique() and
friends (and hence slower). Instead of being able to just tack one
list on to the end of the other, we have to do an actual copy of the
list. This is due to having to allow callers to list_free(), which
will call pfree(), and we don't have any way to know which ListCell's
from the *old* list were pre-allocated and which weren't, which can
end up causing pfree() to crash when it's passed an invalid pointer.
In general, I'd hope that not having to palloc() for a small list
might make up for a lot of that slowdown, but you can't really argue
that anything O(n) can compete with the previous O(1) approach.

Finally, sorry it's kind of a fugly patch, it's just a proof of
concept and I'd be happy to clean it up if others feel it's worthwhile
and a reasonable approach, but I really need to get it out there and
take a break from it (I've been a bit obsessive-compulsive about it
since PGCon.. :D).

Thanks!

Stephen

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Stephen Frost (#1)

Re: Pre-alloc ListCell's optimization

* Stephen Frost (sfrost@snowman.net) wrote:

Finally, sorry it's kind of a fugly patch, it's just a proof of
concept and I'd be happy to clean it up if others feel it's worthwhile
and a reasonable approach, but I really need to get it out there and
take a break from it (I've been a bit obsessive-compulsive about it
since PGCon.. :D).

Erm, sorry, just to clarify, while it's a P-O-C patch, it does compile
cleanly and passes all the regression tests, so it's something that one
can play with at least. Not sure if it'd be worth benchmarking it until
we feel comfortable that this is a decent approach, but I wouldn't
complain if someone decided to...

Thanks,

Stephen

Alvaro Herrera

alvherre@2ndquadrant.com

about 15 years ago

In reply to: Stephen Frost (#1)

Re: Pre-alloc ListCell's optimization

Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011:

A couple of notes regarding the patch:

First, it uses ffs(), which might not be fully portable.. We could
certainly implement the same thing in userspace and use ffs() when
it's available.

Err, see RIGHTMOST_ONE in bitmapset.c.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Alvaro Herrera (#3)

Re: Pre-alloc ListCell's optimization

* Alvaro Herrera (alvherre@commandprompt.com) wrote:

Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011:

A couple of notes regarding the patch:

First, it uses ffs(), which might not be fully portable.. We could
certainly implement the same thing in userspace and use ffs() when
it's available.

Err, see RIGHTMOST_ONE in bitmapset.c.

hah, and I even looked for something first, apparently not very well.
Thanks, simple change at least. :)

Stephen

Alvaro Herrera

alvherre@2ndquadrant.com

about 15 years ago

In reply to: Stephen Frost (#1)

Re: Pre-alloc ListCell's optimization

Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011:

Greetings,

Someone (*cough*Haas*cough) made a claim over beers at PGCon that it
would be very difficult to come up with a way to pre-allocate List
entries and maintain the current List API. I'll admit that it wasn't
quite as trivial as I had *hoped*, but attached is a proof-of-concept
patch which does it.

I think what this patch is mainly missing is a description of how the
new allocation is supposed to work, so that we can discuss the details
without having to reverse-engineer them from the code.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Alvaro Herrera (#5)

Re: Pre-alloc ListCell's optimization

* Alvaro Herrera (alvherre@commandprompt.com) wrote:

I think what this patch is mainly missing is a description of how the
new allocation is supposed to work, so that we can discuss the details
without having to reverse-engineer them from the code.

Sure, sorry I didn't include something more descriptive previously.

Basically, I added a ListCell array into the List structure and then
added a bitmap to keep track of which positions in the array were
filled. I added it as an array simply because makeNode() assumes the
size of a List is static and doesn't call through new_list() or
anything. When a new ListCell is needed, it'll check if there's an
available spot in the array and use it if there is. If there's no
more room left, it'll just fall back to doing a palloc() for the
ListCell. On list_delete(), it'll free up the spot that was used by
that cell. One caveat is that it won't try to clean up the used spots
on a list_truncate (since you'd have to traverse the entire list to
figure out if anything getting truncated off is using a spot in the
array). Of course, if you list_truncate to zero, you'll just get NIL
back and the next round through will generate a whole new/empty List
structure for you.

An alternative approach that I was already considering would be to
just allocate ListCell's in bulk (kind of a poor-man's slab allocator, I
believe). To do that we'd have to make the bitmap be a variable length
array of bitmaps and then have a list of pointers to the ListCell block
allocations. Seems like that's probably overkill for this, however.
The idea for doing this was to address the case of small lists having to
go through the palloc() process over and over. We'd be penalizing those
again if we add a lot of complexity so that vary large lists don't have
to palloc() as much.

Thanks

Stephen

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Stephen Frost (#6)

Re: Pre-alloc ListCell's optimization

Stephen Frost <sfrost@snowman.net> writes:

Basically, I added a ListCell array into the List structure and then
added a bitmap to keep track of which positions in the array were
filled.

Hm. I've gotten the impression from previous testing that there are an
awful lot of extremely short lists (1 or 2 elements) running around in a
typical query. (One source for those is the argument lists for
operators and functions.) I'm worried that this type of approach would
bloat the storage required in those cases to a degree that would make
the patch unattractive. ISTM the first thing we'd need to have before
we could think about this rationally is some measurements about the
frequencies of different List lengths in a typical workload.

When Neil redid the List infrastructure a few years ago, there was some
discussion of special-casing the very first ListCell, and allocating
just that cell along with the List header. That'd be sort of the
minimal version of what you've done here, and would be guaranteed to
never eat any wasted space (since a list that has a header isn't empty).
We should probably compare the behavior of that minimalistic version to
versions with different sizes of preallocated arrays.

An alternative approach that I was already considering would be to
just allocate ListCell's in bulk (kind of a poor-man's slab allocator, I
believe). To do that we'd have to make the bitmap be a variable length
array of bitmaps and then have a list of pointers to the ListCell block
allocations. Seems like that's probably overkill for this, however.

That would be pointing in the direction of trying to save space for very
long Lists, which is a use-case that I'm not sure occurs often enough
for us to be worth spending effort on, and in any case is a distinct
issue from that of saving palloc time for very short Lists. Again, some
statistics about actual list lengths would be really nice to have ...

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Stephen Frost (#1)

Re: Pre-alloc ListCell's optimization

On Tue, May 24, 2011 at 10:56 PM, Stephen Frost <sfrost@snowman.net> wrote:

Someone (*cough*Haas*cough) made a claim over beers at PGCon that it
would be very difficult to come up with a way to pre-allocate List
entries and maintain the current List API. I'll admit that it wasn't
quite as trivial as I had *hoped*, but attached is a proof-of-concept
patch which does it.

[ various points ]

So I guess the first question here is - does it improve performance?

Because if it does, then it's worth pursuing ... if not, that's the
first thing to fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Tom Lane (#7)

Re: Pre-alloc ListCell's optimization

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I'm worried that this type of approach would
bloat the storage required in those cases to a degree that would make
the patch unattractive.

While I agree that there is some bloat that'll happen with this
approach, we could reduce it by just having a 4-entry cache instead of
an 8-entry cache. I'm not really sure that saving those 64 bytes per
list is really worth it though. The cost of allocating the memory
doesn't seem like it changes a lot between those and I don't think it's
terribly common for us to copy lists around (copyList doesn't memcpy()
them).

ISTM the first thing we'd need to have before
we could think about this rationally is some measurements about the
frequencies of different List lengths in a typical workload.

I agree, that'd be a good thing to have. I'll look into measuring that.

When Neil redid the List infrastructure a few years ago, there was some
discussion of special-casing the very first ListCell, and allocating
just that cell along with the List header.

Well, we do allocate the first cell when we create a list in new_list(),
but it's a seperate palloc() call. One of the annoying things that I
ran into with this patch is trying to keep track of if something could
be free'd with pfree() or not. Can't allow pfree() of something inside
the array, etc. Handling the 1-entry case would likely be pretty
straight-forward, but you need book-keeping as soon as you go to two,
and all that book-keeping feels like overkill for just a 2-entry cache
to me.

I'll try to collect some info on list lengths and whatnot though and get
a feel for just how much this is likely to help. Of course, if someone
else has time to help with that, I wouldn't complain. :)

Thanks,

Stephen

#10

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Stephen Frost (#9)

Re: Pre-alloc ListCell's optimization

On Thu, May 26, 2011 at 11:57 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I'm worried that this type of approach would
bloat the storage required in those cases to a degree that would make
the patch unattractive.

While I agree that there is some bloat that'll happen with this
approach, we could reduce it by just having a 4-entry cache instead of
an 8-entry cache. I'm not really sure that saving those 64 bytes per
list is really worth it though.

First off this whole direction seems a bit weird to me. It sounds like
you're just reimplementing palloc inside the List data structure with
its allocator and everything. Why not just improve the memory
allocator in palloc instead of layering a second one on top of it?

But assuming there's an advantage I've missed there's another
possibility here: Are most of these small lists constructed with
list_makeN? In which case maybe the trick would be to special case the
initial contents by hard coding a variable sized array which
represents the first N elements and is only constructed when the list
is first constructed with its initial values. So a list make with
list_make3() would have a 3 element array and then any further
elements added would be in the added cons cells. If any of those were
removed we would decrement the count but leave the array in place.

This would reduce the overhead of any small static lists that aren't
modified much which is probably the real case we're talking about.
Things like operator arguments or things constructed in the parse
tree.

The cost would be the risk of bugs that only occur when something is
passed a 2-element list that was made with list_make2() but not one
made by list_make1() + list_append() or vice versa.

This has the side benefit of allowing an arbitrarily large initial
array (well, as large as the length field for the array size allows)
if we wanted to have something like list_copy_static() which made a
list that was expected not to be modified a lot subsequently and might
as well be stored in a single large array.

But all this seems odd to me. The only reason for any of this is for
api convenience so we can pass around lists instead of passing arrays.
If the next links are really a big source of overhead we should just
fix our apis to take arrays of the right length or arrays with a
separate length argument.

Or if it's just palloc we should fix our memory allocator to behave
the way the callers need it to. Heikki long ago suggested adding a
stack allocator for the parser to use for its memory context to reduce
overhead of small allocations which won't be freed until the context
is freed for example.

--
greg

#11

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Stephen Frost (#9)

Re: Pre-alloc ListCell's optimization

On Thu, May 26, 2011 at 11:57 AM, Stephen Frost <sfrost@snowman.net> wrote:

Handling the 1-entry case would likely be pretty
straight-forward, but you need book-keeping as soon as you go to two,
and all that book-keeping feels like overkill for just a 2-entry cache
to me.

Incidentally what if I call nconc and pass a second arg of a list that
has the first few elements stashed in an array. Do you copy those
elements into cells before doing the nconc? Does our nconc support
having lists share cells? I suspect it doesn't actually so perhaps
that's good enough.

--
greg

#12

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Bruce Momjian (#11)

Re: Pre-alloc ListCell's optimization

* Greg Stark (gsstark@mit.edu) wrote:

On Thu, May 26, 2011 at 11:57 AM, Stephen Frost <sfrost@snowman.net> wrote:

Handling the 1-entry case would likely be pretty
straight-forward, but you need book-keeping as soon as you go to two,
and all that book-keeping feels like overkill for just a 2-entry cache
to me.

Incidentally what if I call nconc and pass a second arg of a list that
has the first few elements stashed in an array. Do you copy those
elements into cells before doing the nconc? Does our nconc support
having lists share cells? I suspect it doesn't actually so perhaps
that's good enough.

nconc() turns into list_concat() which turns into adding list2 on to the
end of list1 using the other normal lappend() routines which will
utilize space in the cache of list1 if there is space available. Trying
to use the old list2 for storage or much of anything turned into a real
pain, unfortunately. list_concat() does explicitly say that cells will
be shared afterwards and that you can't pfree() either list (note that
there's actually a couple cases currently that I discovered which were
also addressed in the original patch where I commented out those
pfree()'s).

Thanks,

Stephen

#13

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Stephen Frost (#12)

Re: Pre-alloc ListCell's optimization

On Thu, May 26, 2011 at 8:52 PM, Stephen Frost <sfrost@snowman.net> wrote:

list_concat() does explicitly say that cells will
be shared afterwards and that you can't pfree() either list (note that
there's actually a couple cases currently that I discovered which were
also addressed in the original patch where I commented out those
pfree()'s).

So in traditional list it would splice the second argument onto the
end of the first list. This has a few effects that it sounds like you
haven't preserved. For example if I insert an element anywhere in
list2 -- including in the first few elements -- it's also inserted
into list1.

I'm not really sure we care about these semantics with our lists
though. It's not like they're supposed to be a full-featured lisp
emulator and it's not like the C code pulls any particularly clever
tricks with lists. I suspect we may have already broken these
semantics long ago but I haven't looked to see if that's the case.

--
greg

#14

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Bruce Momjian (#10)

Re: Pre-alloc ListCell's optimization

* Greg Stark (gsstark@mit.edu) wrote:

On Thu, May 26, 2011 at 11:57 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
While I agree that there is some bloat that'll happen with this
approach, we could reduce it by just having a 4-entry cache instead of
an 8-entry cache. I'm not really sure that saving those 64 bytes per
list is really worth it though.

First off this whole direction seems a bit weird to me. It sounds like
you're just reimplementing palloc inside the List data structure with
its allocator and everything. Why not just improve the memory
allocator in palloc instead of layering a second one on top of it?

I do think it'd be great to improve palloc(), but having looked through
that code, figuring out how to improve it for the small case (such as
with the lists) while keeping it working well for larger and other cases
doesn't exactly look trivial.

But assuming there's an advantage I've missed there's another
possibility here: Are most of these small lists constructed with
list_makeN?

Looks like we've got 306 cases of list_make1(), 82 cases of list_makeN()
(where N > 1), but that said, one can make a list w/ just lappend(), and
that seems to happen with some regularity.

But all this seems odd to me. The only reason for any of this is for
api convenience so we can pass around lists instead of passing arrays.
If the next links are really a big source of overhead we should just
fix our apis to take arrays of the right length or arrays with a
separate length argument.

I'm not really sure I agree with this.. Lists are pretty useful and
easier to manage when you don't know the size. I expect quite a few of
these lists are small for simple queries and can get pretty large for
complex queries. Also, in many cases it's natural to step through the
list and not need random access into it, which at least reduces the
reasons to go to the effort of having a variable length array.

Or if it's just palloc we should fix our memory allocator to behave
the way the callers need it to. Heikki long ago suggested adding a
stack allocator for the parser to use for its memory context to reduce
overhead of small allocations which won't be freed until the context
is freed for example.

Much of this originated from Greg's oprofile and Tom's further
commentary on it here:

http://archives.postgresql.org/pgsql-hackers/2011-04/msg00714.php

Thanks,

Stephen

#15

Stephen Frost

sfrost@snowman.net

about 15 years ago

In reply to: Bruce Momjian (#13)

Re: Pre-alloc ListCell's optimization

Greg,

* Greg Stark (gsstark@mit.edu) wrote:

On Thu, May 26, 2011 at 8:52 PM, Stephen Frost <sfrost@snowman.net> wrote:

list_concat() does explicitly say that cells will
be shared afterwards and that you can't pfree() either list (note that
there's actually a couple cases currently that I discovered which were
also addressed in the original patch where I commented out those
pfree()'s).

So in traditional list it would splice the second argument onto the
end of the first list. This has a few effects that it sounds like you
haven't preserved. For example if I insert an element anywhere in
list2 -- including in the first few elements -- it's also inserted
into list1.

Reading through the comments, it doesn't look like we expressly forbid
that, but it seems pretty unlikely that it's done. In any case, it
wouldn't be difficult to fix, to be honest.. All we'd have to do is
modify list2's head pointer to point to the new location. We do say
that list1 is destructively changed and that the returned pointer must
be used going forward.

I'm not really sure we care about these semantics with our lists
though. It's not like they're supposed to be a full-featured lisp
emulator and it's not like the C code pulls any particularly clever
tricks with lists. I suspect we may have already broken these
semantics long ago but I haven't looked to see if that's the case.

It doesn't look like it was broken previously, but at the same time, it
doesn't look like those semantics are depended upon (or at least,
they're not tested through the regressions :).

Thanks,

Stephen

#16

Stephen Frost

sfrost@snowman.net

about 14 years ago

In reply to: Stephen Frost (#9)

Re: Pre-alloc ListCell's optimization

* Stephen Frost (sfrost@snowman.net) wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

ISTM the first thing we'd need to have before
we could think about this rationally is some measurements about the
frequencies of different List lengths in a typical workload.

I agree, that'd be a good thing to have. I'll look into measuring that.

Ok, it took me, uh, a little while to get around to this, but:

http://tamriel.snowman.net/~sfrost/list_histgram.svg

Is what our list lengths look like for the regression tests. We could
do a pg_bench run, but it looks like Tom's right here- the vast majority
of our lists are small. Highlights:

63% are 1-element
25% are 2-element

Lists of 4 or fewer elements are 97%
Lists of 8 or fewer elements are 99%

So, when it comes to palloc() reduction, this patch would eliminate 99%
of palloc's due to lists. For the regression tests, we're talking about
reducing 893,206 palloc calls to only 1.

Thanks,

Stephen

#17

Stephen Frost

sfrost@snowman.net

about 14 years ago

In reply to: Stephen Frost (#16)

Re: Pre-alloc ListCell's optimization

* Stephen Frost (sfrost@snowman.net) wrote:

So, when it comes to palloc() reduction, this patch would eliminate 99%
of palloc's due to lists. For the regression tests, we're talking about
reducing 893,206 palloc calls to only 1.

Apologies, that wasn't quite right- it'd reduce it to 1 palloc call for
each of the 893,206 lists. In terms of actual comparison of palloc
calls, we have to look at how many are done today for those lists:

palloc calls for:

1-node lists: 1143794
2-node lists: 673071
3-node lists: 205364
4-node lists: 133650
5-node lists: 60552
6-node lists: 28896
7-node lists: 13248
8-node lists: 27045

Total: 2,285,620

Instead, we'd have 1 palloc call for each list, or 893,206, so we'd be
removing ~1.4M calls across the regression suite.

I'll work on cleaning up the patch to be committable early in the 9.3
dev cycle, so it can get a lot of testing prior to the next release.

Thanks,

Stephen

#18

Stephen Frost

sfrost@snowman.net

about 14 years ago

In reply to: Stephen Frost (#1)

Re: Pre-alloc ListCell's optimization

Tom,

* Stephen Frost (sfrost@snowman.net) wrote:

Second, there are a couple of bugs (or at least, I'll characterize
them that way) where we're pfree'ing a list which has been passed to
list_concat(). That's not strictly legal as either argument passed to
list_concat() could be returned and so one might end up free'ing the
list pointer that was just returned. Those pfree's are commented out
in this patch, but really should probably just be removed or replaced
with 'right' code (if it's critical to free this stuff).

A couple of these pfree's were added to simplify_or_arguments() in
commit 56c88772911b4e4c8fbb86d8687d95c3acd38a4f and the other was added
to transformFromClauseItem() in a4996a895399a4b0363c7dace71fc6ce8acbc196.

The two cases in clauses.c are pretty specific and documented:

List *subargs = list_copy(((BoolExpr *) arg)->args);

/* overly tense code to avoid leaking unused list header */
if (!unprocessed_args)
unprocessed_args = subargs;
else
{
List *oldhdr = unprocessed_args;

unprocessed_args = list_concat(subargs, unprocessed_args);
pfree(oldhdr);
}

which really makes me wonder if it's a pretty important cleanup due to
how this call can end up getting nested pretty deeply and the number of
potential loops. On the surface, it looks like this whole chunk could
be reduced to a single list_concat() call. With the new list
implementation, we'll probably want to review this area and make sure
that we don't end up leaking anything through the list_copy/list_concat
usage pattern above. Other thoughts?

In transformFromClauseItem(), the pfree is:

pfree(r_relnamespace); /* free unneeded list header */

which appears to be more general-purpose cleanup that could be safely
removed.

Thanks,

Stephen

#19

Stephen Frost

sfrost@snowman.net

about 14 years ago

In reply to: Stephen Frost (#18)

Re: Pre-alloc ListCell's optimization

* Stephen Frost (sfrost@snowman.net) wrote:

The two cases in clauses.c are pretty specific and documented:

List *subargs = list_copy(((BoolExpr *) arg)->args);

/* overly tense code to avoid leaking unused list header */
if (!unprocessed_args)
unprocessed_args = subargs;
else
{
List *oldhdr = unprocessed_args;

unprocessed_args = list_concat(subargs, unprocessed_args);
pfree(oldhdr);
}

Having thought through this a bit more, with the new list
implementation, it's possible to just replace the pfree(oldhdr); with
list_free(oldhdr); and we may want to document or perhaps encourage that
users of list_concat() consider list_free()'ing the 2nd list if memory
is a concern under the new implementation.

The list_copy() will allocate a new list into subargs. list_concat()
with the new list implementation will just append unprocessed_args on to
the end of subargs, and with unprocessed_args being replaced and that
list only being referenced by oldhdr, we can go ahead and do a shallow
free of that list.

We may want to go back and consider other uses of list_concat() under
the new list implementation, since the amount of memory leak'd now is
larger than just the list header, it's the list header and the array of
8 initial list element slots. Still, these were the only places where
we were worried about free'ing the list header, so perhaps the other
list_concat() uses aren't in highly called areas or are in situations
which get cleaned up by the memory context system sufficiently.