performance expectations for table(s) with 2B recs

Started by David Gauthierover 4 years ago6 messagesgeneral

davegauthierpg@gmail.com

over 4 years ago

11.5 on linux
server = VM provided by our IT dept (IOW, can be grown if needed)
DB itself is on NFS

So far, the tables I have in my DB have relatively low numbers of records
(most are < 10K, all are < 10M). Things have been running great in terms
of performance. But a project is being brainstormed which may require some
tables to contain a couple billion records.

I'm familiar with the need to properly index columns in these tables which
will be used for table joining and query performance (candidates are
columns used in query predicate, etc...). Let's assume that's done right.
And let's assume that the biggest table will have 20 varchars (<= 16 chars
per) and 20 ints. No blobs or jsonb or anything like that.

What else should I be worried about ?

I suspect that part of why things are running really well so far is that
the relatively small amounts of data in these tables ends up in the DB
cache and disk I/O is kept at a minimum. Will that no longer be the case
once queries start running on these big tables ?

What about DBA stuff... vacuum and fragmentation and index maintenance,
etc... ?

I don't want to step into this completely blind. Any warnings/insights
would be appreciated.

Peter J. Holzer

hjp-pgsql@hjp.at

over 4 years ago

In reply to: David Gauthier (#1)

Re: performance expectations for table(s) with 2B recs

On 2021-12-08 14:44:47 -0500, David Gauthier wrote:

So far, the tables I have in my DB have relatively low numbers of records (most
are < 10K, all are < 10M). Things have been running great in terms of
performance. But a project is being brainstormed which may require some tables
to contain a couple billion records.

[...]

What else should I be worried about ?

I suspect that part of why things are running really well so far is that the
relatively small amounts of data in these tables ends up in the DB cache and
disk I/O is kept at a minimum. Will that no longer be the case once queries
start running on these big tables ?

Depends a lot on how good the locality of your queries is. If most read
only the same parts of the same indexes, those will still be in the
cache. If they are all over the place or if you have queries which need
to read large parts of your tables, cache misses will make your
performance a lot less predictable, yes. That stuff is also hard to
test, because when you are testing a query twice in a row, the second
time it will likely hit the cache and be quite fast.

But in my experience the biggest problem with large tables are unstable
execution plans - for most of the parameters the optimizer will choose
to use an index, but for some it will erroneously think that a full
table scan is faster. That can lead to a situation where a query
normally takes less than a second, but sometimes (seemingly at random)
it takes several minutes - users will understandably be upset about such
behaviour. It is in any case a good idea to monitor execution times to
find such problems (ideally before users complain), but each needs to be
treated on an individual basis, and sometimes there seems to be no good
solution.

Marc Millas

marc.millas@mokadb.com

over 4 years ago

In reply to: David Gauthier (#1)

Re: performance expectations for table(s) with 2B recs

Partitioning ?
if you have some ideas about how this data is accessed, splitting those
big tables into partitions may help:
-vaccum done at the partition level,
-index partitioned too, so much easier to manage (reindex local to a given
partition, so quite easy to reindex the whole thing one partition at a time)

great on perf, too IF the partition key is in the where clause. if not,
postgres will need to aggregate and sort the results gathered for each
partition.
Could still be efficient if it allows to parallel execution.

my 2 cents

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Wed, Dec 8, 2021 at 8:45 PM David Gauthier <davegauthierpg@gmail.com>
wrote:

Show quoted text

11.5 on linux
server = VM provided by our IT dept (IOW, can be grown if needed)
DB itself is on NFS

So far, the tables I have in my DB have relatively low numbers of records
(most are < 10K, all are < 10M). Things have been running great in terms
of performance. But a project is being brainstormed which may require some
tables to contain a couple billion records.

I'm familiar with the need to properly index columns in these tables which
will be used for table joining and query performance (candidates are
columns used in query predicate, etc...). Let's assume that's done right.
And let's assume that the biggest table will have 20 varchars (<= 16 chars
per) and 20 ints. No blobs or jsonb or anything like that.

What else should I be worried about ?

I suspect that part of why things are running really well so far is that
the relatively small amounts of data in these tables ends up in the DB
cache and disk I/O is kept at a minimum. Will that no longer be the case
once queries start running on these big tables ?

What about DBA stuff... vacuum and fragmentation and index maintenance,
etc... ?

I don't want to step into this completely blind. Any warnings/insights
would be appreciated.

Godfrin, Philippe E

Philippe.Godfrin@nov.com

over 4 years ago

In reply to: Peter J. Holzer (#2)

RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

On 2021-12-08 14:44:47 -0500, David Gauthier wrote:

So far, the tables I have in my DB have relatively low numbers of records (most
are < 10K, all are < 10M). Things have been running great in terms of
performance. But a project is being brainstormed which may require some tables
to contain a couple billion records.

[...]

What else should I be worried about ?

I suspect that part of why things are running really well so far is that the
relatively small amounts of data in these tables ends up in the DB cache and
disk I/O is kept at a minimum. Will that no longer be the case once queries
start running on these big tables ?

Depends a lot on how good the locality of your queries is. If most read
only the same parts of the same indexes, those will still be in the
cache. If they are all over the place or if you have queries which need
to read large parts of your tables, cache misses will make your
performance a lot less predictable, yes. That stuff is also hard to
test, because when you are testing a query twice in a row, the second
time it will likely hit the cache and be quite fast.

But in my experience the biggest problem with large tables are unstable
execution plans - for most of the parameters the optimizer will choose
to use an index, but for some it will erroneously think that a full
table scan is faster. That can lead to a situation where a query
normally takes less than a second, but sometimes (seemingly at random)
it takes several minutes - users will understandably be upset about such
behaviour. It is in any case a good idea to monitor execution times to
find such problems (ideally before users complain), but each needs to be
treated on an individual basis, and sometimes there seems to be no good
solution.

To the OP, that's is a tall order to answer - basically that's wjhy DBA's still have
Jobs...

For Peter I have a question. What exactly causes 'unstable execution plans' ??

Besides not using bind variables, bad statistics, would you elaborate in what would
contribute to that instability?

Peter J. Holzer

hjp-pgsql@hjp.at

over 4 years ago

In reply to: Godfrin, Philippe E (#4)

Re: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote:

But in my experience the biggest problem with large tables are unstable
execution plans - for most of the parameters the optimizer will choose
to use an index, but for some it will erroneously think that a full
table scan is faster. That can lead to a situation where a query
normally takes less than a second, but sometimes (seemingly at random)
it takes several minutes

[...]

For Peter I have a question. What exactly causes ‘unstable execution plans’ ??

Besides not using bind variables, bad statistics, would you elaborate
in what would contribute to that instability?

Not using bind variables and bad statistics are certainly big factors:

On one hand not using bind variables gives a lot more information to the
optimizer, so it can choose a better plan at run time. On the other hand
that makes hard to predict what plan it will choose.

Bad statistics come in many flavours: They might just be wrong, that's
usually easy to fix. More problematic are statistics which just don't
describe reality very well - they may not show a correlation, causing
the optimizer to assume that two distributions are independent when they
really aren't (since PostgreSQL 10 you can create statistics on multiple
columns which helps in many but not all cases) or not show some other
peculiarity of the data. Or they may be just so close to a flipping
point that a small change causes the optimizer to choose a wildly
different plan.

Another source is dynamically generated SQL. Your application may just
put together SQL from fragments or it might use something like
SQLalchemy or an ORM. In any of these cases what looks like one query
from a user's perspective may really be a whole family of related
queries - and PostgreSQL will try to find the optimal plan for each of
them. Which is generally a good thing, but it adds opportunities to mess
up.

Godfrin, Philippe E

Philippe.Godfrin@nov.com

over 4 years ago

In reply to: Peter J. Holzer (#5)

RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

From: Peter J. Holzer hjp-pgsql@hjp.at<mailto:hjp-pgsql@hjp.at>
Sent: Friday, December 10, 2021 3:43 PM
To: pgsql-general@lists.postgresql.org<mailto:pgsql-general@lists.postgresql.org>
Subject: Re: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote:

But in my experience the biggest problem with large tables are unstable
execution plans - for most of the parameters the optimizer will choose
to use an index, but for some it will erroneously think that a full
table scan is faster. That can lead to a situation where a query
normally takes less than a second, but sometimes (seemingly at random)
it takes several minutes

[...]

For Peter I have a question. What exactly causes ‘unstable execution plans’ ??

Besides not using bind variables, bad statistics, would you elaborate
in what would contribute to that instability?

Not using bind variables and bad statistics are certainly big factors:

On one hand not using bind variables gives a lot more information to the
optimizer, so it can choose a better plan at run time. On the other hand
that makes hard to predict what plan it will choose.

Bad statistics come in many flavours: They might just be wrong, that's
usually easy to fix. More problematic are statistics which just don't
describe reality very well - they may not show a correlation, causing
the optimizer to assume that two distributions are independent when they
really aren't (since PostgreSQL 10 you can create statistics on multiple
columns which helps in many but not all cases) or not show some other
peculiarity of the data. Or they may be just so close to a flipping
point that a small change causes the optimizer to choose a wildly
different plan.

Another source is dynamically generated SQL. Your application may just
put together SQL from fragments or it might use something like
SQLalchemy or an ORM. In any of these cases what looks like one query
from a user's perspective may really be a whole family of related
queries - and PostgreSQL will try to find the optimal plan for each of
them. Which is generally a good thing, but it adds opportunities to mess
up.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at<mailto:hjp@hjp.at> | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Good answer Peter, I agree wholeheartedly. I was curious if there was something specific to Postgresql .
phil