On Scalability
Hi all.
I'm wondering about PGSQL scalability.
In particular I have two main topics in my mind:
1. What'd be the behavior of the query planner in the case I have
a single huge table with hundreds or thousands of partial indexes
(just differing by the WHERE clause).
This is an idea of mine to make index-partitioning instead of
table-partitioning.
2. What'd be the behavior of the query planner in the case I have
hundreds or thousands of child tables, possibly in a multilevel hierarchy
(let's say, partitioning by year, month and company).
I fear the presence of linear selection algorithms in these two cases that
would kill my design.
Is there any insight about these two points?
--
NotOrAnd Information Technologies
Vincenzo Romano
--
NON QVIETIS MARIBVS NAVTA PERITVS
On Thu, 2010-07-29 at 19:08 +0200, Vincenzo Romano wrote:
Hi all.
I'm wondering about PGSQL scalability.
In particular I have two main topics in my mind:1. What'd be the behavior of the query planner in the case I have
a single huge table with hundreds or thousands of partial indexes
(just differing by the WHERE clause).
This is an idea of mine to make index-partitioning instead of
table-partitioning.
Well the planner is not going to care about the partial indexes that
don't match the where clause but what you are suggesting is going to
make writes and maintenance extremely expensive. It will also increase
planning time as the optimizer at a minimum has to discard the use of
those indexes.
2. What'd be the behavior of the query planner in the case I have
hundreds or thousands of child tables, possibly in a multilevel hierarchy
(let's say, partitioning by year, month and company).
Again, test it. Generally speaking the number of child tables directly
correlates to planning time. Most experience that 60-100 tables is
really the highest you can go.
It all depends on actual implementation and business requirements
however.
Sincerely,
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
2010/7/29 Joshua D. Drake <jd@commandprompt.com>:
On Thu, 2010-07-29 at 19:08 +0200, Vincenzo Romano wrote:
Hi all.
I'm wondering about PGSQL scalability.
In particular I have two main topics in my mind:1. What'd be the behavior of the query planner in the case I have
a single huge table with hundreds or thousands of partial indexes
(just differing by the WHERE clause).
This is an idea of mine to make index-partitioning instead of
table-partitioning.Well the planner is not going to care about the partial indexes that
don't match the where clause but what you are suggesting is going to
make writes and maintenance extremely expensive. It will also increase
planning time as the optimizer at a minimum has to discard the use of
those indexes.2. What'd be the behavior of the query planner in the case I have
hundreds or thousands of child tables, possibly in a multilevel hierarchy
(let's say, partitioning by year, month and company).Again, test it. Generally speaking the number of child tables directly
correlates to planning time. Most experience that 60-100 tables is
really the highest you can go.It all depends on actual implementation and business requirements
however.Sincerely,
Joshua D. Drake
I expect that a more complex schema will imply higher workloads
on the query planner. What I don't know is how the increase in the
workload will happen: linearly, sublinearly, polinomially or what?
Significant testing would require a prototype implementation with
an almost complete feed of data from the current solution.
But I'm at the feasibility study stage and have not enough resources
for that.
Thanks anyway for the insights, Joshua.
Does the 60-100 tables limit applies to a single level
of inheritance? Or is it more general?
--
NotOrAnd Information Technologies
Vincenzo Romano
--
NON QVIETIS MARIBVS NAVTA PERITVS
On Thu, 2010-07-29 at 19:34 +0200, Vincenzo Romano wrote:
I expect that a more complex schema will imply higher workloads
on the query planner. What I don't know is how the increase in the
workload will happen: linearly, sublinearly, polinomially or what?Significant testing would require a prototype implementation with
an almost complete feed of data from the current solution.
But I'm at the feasibility study stage and have not enough resources
for that.Thanks anyway for the insights, Joshua.
Does the 60-100 tables limit applies to a single level
of inheritance? Or is it more general?
I do not currently have experience (except that it is possible) with
multi-level inheritance and postgresql.
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
2010/7/29 Joshua D. Drake <jd@commandprompt.com>:
On Thu, 2010-07-29 at 19:34 +0200, Vincenzo Romano wrote:
I expect that a more complex schema will imply higher workloads
on the query planner. What I don't know is how the increase in the
workload will happen: linearly, sublinearly, polynomially or what?
Do you think I should ask somewhere else?
Any hint?
Thanks anyway for the insights, Joshua.
Does the 60-100 tables limit applies to a single level
of inheritance? Or is it more general?I do not currently have experience (except that it is possible) with
multi-level inheritance and postgresql.
Thanks anyway.
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
NON QVIETIS MARIBVS NAVTA PERITVS
On Thu, 2010-07-29 at 19:52 +0200, Vincenzo Romano wrote:
2010/7/29 Joshua D. Drake <jd@commandprompt.com>:
On Thu, 2010-07-29 at 19:34 +0200, Vincenzo Romano wrote:
I expect that a more complex schema will imply higher workloads
on the query planner. What I don't know is how the increase in the
workload will happen: linearly, sublinearly, polynomially or what?Do you think I should ask somewhere else?
Any hint?
The two people that would likely know the best are on vacation, TGL and
Heikki. You may have to wait a bit.
Sincerely,
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
Do you think I should ask somewhere else?
Any hint?
I might suggest asking on the pgsql-performance mailing list instead.
You'll get *lots* more speculation there. However, the only way you're
really going to know is to test.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
2010/7/29 Josh Berkus <josh@agliodbs.com>:
Do you think I should ask somewhere else?
Any hint?I might suggest asking on the pgsql-performance mailing list instead.
You'll get *lots* more speculation there. However, the only way you're
really going to know is to test.
Or maybe checking against the source code and its documentation, if any.
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
NON QVIETIS MARIBVS NAVTA PERITVS
2010/7/29 Josh Berkus <josh@agliodbs.com>:
Or maybe checking against the source code and its documentation, if any.
No, not really. What you really want to know is: what's the real
planner overhead of having dozens/hundreds of partial indexes? What's
the write overhead? There's no way you can derive that from the source
code faster than you can test it.
Again, as the test would be rather killing for my group at this stage.
I think that knowing whether certain parts have been implemented
with linear or sub-linear (or whatever else) algorithms would
give good insights about scalability.
At a first glance it seems that for inheritance some bottleneck is
hindering a full exploit for table partitioning.
Is there anyone who knows whether those algorithms are linear or not?
And of course, I agree that real tests on real data will provide the real thing.
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
NON QVIETIS MARIBVS NAVTA PERITVS
Import Notes
Reply to msg id not found: 4C51F423.8070006@agliodbs.com
On Fri, Jul 30, 2010 at 11:24 AM, Vincenzo Romano
<vincenzo.romano@notorand.it> wrote:
At a first glance it seems that for inheritance some bottleneck is
hindering a full exploit for table partitioning.
There have been lengthy discussions of how to implement partitioning
to fix these precise problems, yes.
Is there anyone who knows whether those algorithms are linear or not?
They're linear in both cases. But they happen at plan time rather than
query execution time. So if your application prepares all its queries
and then uses them many times it would not slow down query execution
but would slow down the query planning time. In some applications this
is much better but in others unpredictable run-times is as bad as long
run-times.
Also in the case of having many partial indexes it would slow down
inserts and updates as well, though to a lesser degree, and that would
happen at execution time.
--
greg
2010/7/30 Greg Stark <gsstark@mit.edu>:
On Fri, Jul 30, 2010 at 11:24 AM, Vincenzo Romano
<vincenzo.romano@notorand.it> wrote:At a first glance it seems that for inheritance some bottleneck is
hindering a full exploit for table partitioning.There have been lengthy discussions of how to implement partitioning
to fix these precise problems, yes.
Any reference?
Is there anyone who knows whether those algorithms are linear or not?
They're linear in both cases. But they happen at plan time rather than
query execution time. So if your application prepares all its queries
and then uses them many times it would not slow down query execution
but would slow down the query planning time. In some applications this
is much better but in others unpredictable run-times is as bad as long
run-times.
Hmmm ... maybe I'm missing the inner meaning of your remarks, Greg.
By using PREPARE I run the query planned sooner and I should use
the plan with the later execution.
You can bet that some of the PREPAREd query variables will
pertain to either the child table's CHECK contraints (for table partitions)
or to the partial index's WHERE condition (for index partitioning).
It's exactly this point (execution time) where the "linearity" will
kill the query
over a largely partitioned table.
Is this what you meant? :-)
Also in the case of having many partial indexes it would slow down
inserts and updates as well, though to a lesser degree, and that would
happen at execution time.
This makes fully sense to me.
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
NON QVIETIS MARIBVS NAVTA PERITVS
Is there anyone who knows whether those algorithms are linear or not?
Read the code? It's really very accessible, and there's lots and lots
of comments. While the person who wrote the code is around, isn't it
better to see the real implementation?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
2010/7/30 Josh Berkus <josh@agliodbs.com>:
Is there anyone who knows whether those algorithms are linear or not?
Read the code? It's really very accessible, and there's lots and lots
of comments. While the person who wrote the code is around, isn't it
better to see the real implementation?
If the programmer(s) who wrote that part is around, a simple hint would suffice.
Even an hint to where look into the code would be very appreciated: the query
planner is not as simple as the "ls" command (which is not that simple any
more, though).
It looks like I need to go the hard way ...
Starting from postgresql-8.4.4/src/backend/optimizer
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
cel +393398083886 fix +390823454163 fax +3902700506964
gtalk. vincenzo.romano@notorand.it skype. notorand.it
--
NON QVIETIS MARIBVS NAVTA PERITVS
Vincenzo Romano wrote:
By using PREPARE I run the query planned sooner and I should use
the plan with the later execution.
You can bet that some of the PREPAREd query variables will
pertain to either the child table's CHECK contraints (for table partitions)
or to the partial index's WHERE condition (for index partitioning).
Prepared statements are not necessarily a cure for long query planning
time, because the sort of planning decisions made with partitioned child
tables and index selection can need to know the parameter values to
execute well; that's usually the situation rather than the exception
with partitions. You run the risk that the generic prepared plan will
end up looking at all the partitions, because at preparation plan time
it can't figure out which can be excluded. Can only figure that out
once they're in there for some types of queries.
I think you aren't quite lined up with the people suggesting "test it"
in terms of what that means. The idea is not that you should build a
full on application test case yet, which can be very expensive. The
idea is that you might explore things like "when I partition this way
increasing the partitions from 1 to n, does query time go up linearly?"
by measuring with fake data and a machine-generated schema. What's
happened in some of these cases is that, despite the theoretical, some
constant or external overhead ends up dominating behavior for lower
numbers. As an example, it was recognized that the amount of statistics
for a table collected with default_statistics_target had a quadratic
impact on some aspects of performance. But it turned out that for the
range of interesting values to most people, the measured runtime did not
go up with the square as feared. Only way that was sorted out was to
build a simple simulation.
Here's a full example from that discussion that shows the sort of tests
you probably want to try, and comments on the perils of guessing based
on theory rather than testing:
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00601.php
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00687.php
generate_series can be very helpful here, and you can even use that to
generate timestamps if you need them in the data set.
That said, anecdotally everyone agrees that partitions don't scale well
into even the very low hundreds for most people, and doing multi-level
ones won't necessarily normally drop query planning time--just the cost
of maintaining the underlying tables and indexes. My opinion is that
building a simple partitioned case and watching how the EXPLAIN plans
change as you adjust things will be more instructive for you than either
asking about it or reading the source. Vary the parameters, watch the
plans, measure things and graph them if you want to visualize the
behavior better. Same thing goes for large numbers of partial indexes,
which have a similar query planning impact, but unlike partitions I
haven't seen anyone analyze them via benchmarks. I'm sure you could get
help here (probably the performance list is a better spot though) with
getting your test case right if you wanted to try and nail that down.
--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us
On Fri, Jul 30, 2010 at 3:50 PM, Vincenzo Romano
<vincenzo.romano@notorand.it> wrote:
2010/7/30 Josh Berkus <josh@agliodbs.com>:
Is there anyone who knows whether those algorithms are linear or not?
Read the code? It's really very accessible, and there's lots and lots
of comments. While the person who wrote the code is around, isn't it
better to see the real implementation?If the programmer(s) who wrote that part is around, a simple hint would suffice.
Even an hint to where look into the code would be very appreciated: the query
planner is not as simple as the "ls" command (which is not that simple any
more, though).It looks like I need to go the hard way ...
Starting from postgresql-8.4.4/src/backend/optimizer
I think you're approaching this in the wrong way. You've repeatedly
said you don't want to do all the work of setting up a test, but
trying to search the code for algorithms that might not be linear is
not going to be easier. I've been reading this thread and I'm fairly
familiar with this code, and I even understand the algorithms pretty
well, and I don't know whether they're going to be linear for what you
want to do or not. Certainly, the overall task of join planning is
exponential-time in the number of *tables*, but if you're just doing
SELECT statements on a single table, will that be linear? Tough to
say. Certainly, there are a lot of linked lists in there, so if we
have any place where we have two nested loops over the list of
indices, it won't be linear. I can't think of a place where we do
that, but that doesn't mean there isn't one. And even if they are
linear, or n log n or something, the coefficients could still be
lousy. Theoretical computer science is one of my favorite subjects,
but, it doesn't always tell you what you want to know about the real
world.
It doesn't seem like it should be very hard to figure this out
empirically. Just create a big database full of random data. Maybe
you could populate one of the columns with something like (random() *
1000)::int. Then you could create partial indices ON
(some_other_column) WHERE that_column = <blat> for <blat> in 0..999.
Then you could run some test queries and see how you make out.
Or, I mean, you can read the source code. That's fine, too. It's
just... I've already read the source code quite a few times, and I
still don't know the answer. Admittedly, I wasn't trying to answer
this specific question, but still - I don't think it's an easy
question to answer that way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Hi all.
I laready posted this a couple of months ago on -hackers:
http://archives.postgresql.org/pgsql-hackers/2010-07/msg01519.php
I've also been directed to ask here for better and deeper details.
What came out is that the management of both inheritance hierarchy and
partial indexes doesn't scale well up as it'd have a linear algorithm
deep in its bowels.
What's the "real" story, then?
Thanks in advance.
2010/7/29 Vincenzo Romano <vincenzo.romano@notorand.it>:
Hi all.
I'm wondering about PGSQL scalability.
In particular I have two main topics in my mind:1. What'd be the behavior of the query planner in the case I have
a single huge table with hundreds or thousands of partial indexes
(just differing by the WHERE clause).
This is an idea of mine to make index-partitioning instead of
table-partitioning.2. What'd be the behavior of the query planner in the case I have
hundreds or thousands of child tables, possibly in a multilevel hierarchy
(let's say, partitioning by year, month and company).I fear the presence of linear selection algorithms in these two cases that
would kill my design.Is there any insight about these two points?
--
NotOrAnd Information Technologies
Vincenzo Romano
--
NON QVIETIS MARIBVS NAVTA PERITVS
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
NON QVIETIS MARIBVS NAVTA PERITVS
Any feedbacks from TGL and Heikki, then?
2010/7/29 Joshua D. Drake <jd@commandprompt.com>:
On Thu, 2010-07-29 at 19:52 +0200, Vincenzo Romano wrote:
2010/7/29 Joshua D. Drake <jd@commandprompt.com>:
On Thu, 2010-07-29 at 19:34 +0200, Vincenzo Romano wrote:
I expect that a more complex schema will imply higher workloads
on the query planner. What I don't know is how the increase in the
workload will happen: linearly, sublinearly, polynomially or what?Do you think I should ask somewhere else?
Any hint?The two people that would likely know the best are on vacation, TGL and
Heikki. You may have to wait a bit.Sincerely,
Joshua D. Drake
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
--
Vincenzo Romano at NotOrAnd Information Technologies
Software Hardware Networking Training Support Security
--
cel +393398083886 fix +390823454163 fax +3902700506964
gtalk. vincenzo.romano@notorand.it skype. notorand.it
--
NON QVIETIS MARIBVS NAVTA PERITVS
On 07.10.2010 10:09, Vincenzo Romano wrote:
Any feedbacks from TGL and Heikki, then?
I don't have anything to add to what others said already. Your best
advice is to test it yourself.
I would expect the plan time to be linear relative to the number of
partial indexes or child tables involved, except that constraint
exclusion of CHECK constraints on the partitions is exponential. But I
also wouldn't be surprised if there's some other non-linear aspect there
that shows its head with thousands of partitions.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-10-07 at 10:28 +0300, Heikki Linnakangas wrote:
constraint exclusion of CHECK constraints on the partitions is
exponential
Constraint exclusion is linear with respect to number of partitions.
Why do you say exponential?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services
On 07.10.2010 10:41, Simon Riggs wrote:
On Thu, 2010-10-07 at 10:28 +0300, Heikki Linnakangas wrote:
constraint exclusion of CHECK constraints on the partitions is
exponentialConstraint exclusion is linear with respect to number of partitions.
Why do you say exponential?
For some reason I thought the planner needs to check the constraints of
the partitions against each other, but you're right, clearly that's not
the case. Linear it is.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com