Re: Large DB

Started by Mooney, Ryanabout 22 years ago19 messageshackersgeneral
Jump to latest
#1Mooney, Ryan
ryan.mooney@pnl.gov
hackersgeneral

Thanks for the excellent feedback (all)!

Good point on the excess bytes/row, not sure how to explain that. There
have never been any deletes or updates on this table and all inserts
just simple inserts (no transactions or anything funky) so there
shouldn't be many aborted transactions (I could see the daemons that do
the inserts dying part way through a few times, but nothing to explain
the variance).

I haven't run ANALYZE on this table in a while. After about 50-60M rows
it didn't seem to change the query plan at all and since there were
never any deletes/updates it seemed like it wasn't making much/any
difference (should have been no pages to reclaim). That may be an
invalid assumption though.

I'll try the other suggestions over the next couple of days and see how
it goes. Thanks again.

Here is an explain on the query:

=> explain select point, avg(pvalue) as avg from tp3 where host in
('m563', 'm562', 'm561', 'm560', 'm559', 'm558', 'm557', 'm538', 'm755',
'm754', 'm753', 'm752', 'm751', 'm750', 'm749', 'm748') and starttime
between '2004-03-27 07:37:43' and '2004-03-30 07:40:08' group by point;

HashAggregate (cost=96.90..96.90 rows=1 width=25)
-> Index Scan using tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime on tp3 (cost=0.00..96.90 rows=1
width=25)
Index Cond: (((host = 'm563'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm562'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm561'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm560'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm559'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm558'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm557'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm538'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm755'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm754'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm753'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm752'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm751'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm750'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm749'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm748'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)))
Filter: (((host = 'm563'::bpchar) OR (host = 'm562'::bpchar) OR
(host = 'm561'::bpchar) OR (host = 'm560'::bpchar) OR (host =
'm559'::bpchar) OR (host = 'm558'::bpchar) OR (host = 'm557'::bpchar) OR
(host = 'm538'::bpchar) OR (host = 'm755'::bpchar) OR (host =
'm754'::bpchar) OR (host = 'm753'::bpchar) OR (host = 'm752'::bpchar) OR
(host = 'm751'::bpchar) OR (host = 'm750'::bpchar) OR (host =
'm749'::bpchar) OR (host = 'm748'::bpchar)) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone))
(4 rows)

Show quoted text

-----Original Message-----
From: Manfred Koizar [mailto:mkoi-pg@aon.at]
Sent: Wednesday, March 31, 2004 1:18 AM
To: Mooney, Ryan
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Large DB

On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan"
<ryan.mooney@pnl.gov>
wrote:

I have a single table that just went over 234GB in size with about
290M+ rows.

That would mean ~ 800 bytes/row which, given your schema, is
hard to believe unless there are lots of dead tuples lying around.

queries use the indexes fairly well, although I suspect that

the order

of host/starttime is suboptimal (fewer hosts than starttime, and the
table is naturally in starttime order). I'm going to try adding an
index on just starttime (and later just host) and see if I

can tune the

queries on that more.

Yes, if you are ready to switch OS for a 10% performance
gain, getting your indices right should be no question.

I never delete rows from the table, only do
inserts (up to around 11,000/minute mostly in one big burst every
minute, this is anticipated to go up some over time).

How often do you ANALYSE?

Have there been DELETEs or UPDATEs or aborted transactions in
the past? Did you VACUUM or VACUUM FULL since then?

I'm only doing sub 15MB from the disk
array (from iostat) and I know it can do in the 40-60MB

range when we

tested the raw speed,

Sounds plausible for nonsequential I/O.

However the two indexes are also - large (which may be part of the
problem, which is why I'm trying just starttime for an

index; They are

currently in the 140-150G range).

This would be extreme index bloat which is only possible
after massive DELETEs/UPDATEs.

stats=> explain select count(*) from tp3;
-> Seq Scan on tp3 (cost=0.00..6906493.16

rows=290602016 width=0)

The planner thinks that the table size is 4M pages, 32GB.
The average tuple size of ~110 bytes (including tuple header)
suits your schema quite nicely.

Table "public.tp3"
Column | Type | Modifiers
-------------+-----------------------------+-----------
host | character(4) |
point | character varying(64) |
type | character(1) |
cooked | character(1) |
starttime | timestamp without time zone |
intervallen | interval |
arrivetime | timestamp without time zone |
pvalue | numeric |
Indexes:
"tp3_host_starttime" btree (host, starttime, cooked)
"tp3_point_starttime" btree (point, starttime, cooked)

In my experience any reduction in average tuple size results
directly in a proportional increase of throughput for large
tables. So here are some random thoughts:

You said there are only a few hosts. So moving the hosts
into a separate table with an integer primary key would save
4 bytes per row.

Datatype "char" (with quotes) needs only 1 byte, char(1)
needs 5 bytes, both before padding. Changing type and cooked
from char(1) to "char" would save 12 bytes.

And if you want to push it, you change hostid to smallint and
rearrange the fields, saving 4 more padding bytes:
hostid | smallint
type | "char"
cooked | "char"

What about point? If there is a known small number of
different values, move it into its own table.

I'm not sure about the storage needs of numeric, might be at
least 8 bytes. Consider using bigint. Someone please correct
me if I'm wrong.

Did you CREATE TABLE tp3 (...) WITHOUT OIDS?

Sample data mining query:
----------------------------
select point, avg(pvalue) as avg from tp3 where host in ('node',
'node',
....) and starttime between 'timestamp' and 'timestamp'

group by point

Show us EXPLAIN ANALYSE, please.

shared_buffers = 60800

Looks a bit large to me. But if your tests have shown it to
be the best value, it should be ok.

sort_mem = 1286720 # min 64, size in KB

This is more than 1GB, I think this is too high.

fsync=false # Play fast and loose - whee

How much did this help?

effective_cache_size = 160000

Try more, say 320000 or even 400000.

Servus
Manfred

#2Manfred Koizar
mkoi-pg@aon.at
In reply to: Mooney, Ryan (#1)
hackersgeneral

On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:

I haven't run ANALYZE on this table in a while. After about 50-60M rows
it didn't seem to change the query plan at all and since there were
never any deletes/updates it seemed like it wasn't making much/any
difference (should have been no pages to reclaim).

Reclaiming pages is not the job of ANALYSE, VACUUM does this.

That may be an
invalid assumption though.

Might be a valid assumption as well -- if you're lucky. But do you want
to depend on luck? Eg. 75% of the today's rows contain timestamps that
are greater than what the planner believes to be the maximum.

No VACCUM, no ANALYSE, no REINDEX. This explains why the planner thinks
there are only 4M pages, which gives 640 bytes/row if there were 50M
rows at that time. OTOH the EXPLAIN shows 290M rows for the seq scan.
Something doesn't fit together here.

Hackers, what could update reltuples, but not relpages?

Or, Ryan, is it possible that you already had 290M rows when you ran
ANALYSE and you have more than 1G rows today?

BTW, ANALYSE is basically a constant time operation.

Here is an explain on the query:

=> explain select point, avg(pvalue) as avg from tp3 where host in

This tells us one half of the story.

EXPLAIN ANALYSE SELECT ...

would tell us the other half, too.

Servus
Manfred

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#2)
hackersgeneral

Manfred Koizar <mkoi-pg@aon.at> writes:

Hackers, what could update reltuples, but not relpages?

Nothing --- they are always updated together. One possibility is that
the 4M pages and 290M rows numbers really do go together (for about 112
bytes/row) and that the table has since grown, or perhaps merely bloated
due to lack of vacuuming of updated rows.

A different line of thought is that they were updated together, but the
relpages estimate was accurate while reltuples was not. ANALYZE knows
the actual table size in pages (because it asks the OS) but reltuples is
extrapolated from an average of the number of live tuples on the pages
ANALYZE looks at. It is possible for ANALYZE to be fooled badly if, for
instance, there are lots and lots of dead rows near the start of the
table. (Lack of regular vacuuming would certainly improve the odds of
this happening...)

Note that VACUUM is not subject to this error because it has to grovel
over every page anyway. So either "VACUUM" or "VACUUM ANALYZE" will
give you a known-good reltuples, it's only standalone "ANALYZE" that
has a risk of estimation error.

regards, tom lane

#4Manfred Koizar
mkoi-pg@aon.at
In reply to: Manfred Koizar (#2)
hackersgeneral

On Thu, 01 Apr 2004 12:22:58 +0200, I wrote:

BTW, ANALYSE is basically a constant time operation.

On closer inspection, this is not the whole truth. ANALY[SZ]E is a two
stage process: First it collects a sample of rows, then these rows are
examined to produce various statistics.

The cost of the latter depends on the sample size, which itself depends
on the default or column-specific statistics target, and the number (and
types) of columns, so it *should* take more or less constant time.

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample. It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

I have an idea how this could be done with O(1) page reads. If I'm able
to translate it into C, I'll send a patch ...

Servus
Manfred

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#4)
hackersgeneral

Manfred Koizar <mkoi-pg@aon.at> writes:

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample. It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

Good guess. Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

I have an idea how this could be done with O(1) page reads.

The hard part is getting a genuinely random sample when we don't know N
in advance. We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different. (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

regards, tom lane

#6Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#5)
hackersgeneral

[time to move this to -hackers]

On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Manfred Koizar <mkoi-pg@aon.at> writes:

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample. It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

Good guess. Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

Well, for what I tried to find out my wild guess seems to be wrong.

I don't doubt that Vitter's formula is correct, but it assumes that
access to any tuple has the same cost. This does not apply to our
problem, however. With 100 tuples per page, we access the first
sample_size tuples at a cost of 0.01 sequential page reads per tuple.
Later we use less and less tuples per page which results in higher
per-tuple-cost. Near the end of a large relation we can expect to
access only one tuple per page and more and more pages are skipped, so
that prefetching doesn't help any more.

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:

rel | page
size | reads
------+-------------
30 | 30
300 | 300 expectation is something like 299.9995
500 | 499
1K | 990
3K | 2.6K
30K | 8K
100K | 12K
1M | 19K
10M | 26K
100M | 33K

This growth rate is steeper than O(log(nblocks)).

I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks. So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

I have to think harder though, what to do about those 400 pages that are
not accessed when the sample size is 3000 ...

The hard part is getting a genuinely random sample when we don't know N
in advance. We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different. (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

Starting the scan at some random blocks should help against the common
case of unusual distribution of dead tuples near the start of the
relation. And I plan to factor information about dead tuple hits into
an increasingly better estimation of dead/live tuple ratio.

Servus
Manfred

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#6)
hackersgeneral
Re: [GENERAL] Large DB

Manfred Koizar <mkoi-pg@aon.at> writes:

What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks. So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

That assumption is faulty, though --- consider wholly-empty pages.

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples. Thus for example your sample would tend to favor
rows with wide values of variable-width columns and exclude narrower
values. (I am not certain that the existing algorithm completely avoids
this trap, but at least it tries.)

regards, tom lane

#8Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#7)
hackersgeneral
Re: [GENERAL] Large DB

On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Manfred Koizar <mkoi-pg@aon.at> writes:

What I have in mind is a kind of "Double Vitter" algorithm. [...]
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

That assumption is faulty, though --- consider wholly-empty pages.

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples.

This sounds like you are assuming that I want to take exactly one tuple
out of each block of the block sample. This is not the case. In the
second round I plan to apply the same (or a better) Vitter method as it
is done now. The main difference is that blocks will be adressed
indirectly through the array of block numbers obtained in the first
round.

Thus for example your sample would tend to favor
rows with wide values of variable-width columns and exclude narrower
values. (I am not certain that the existing algorithm completely avoids
this trap, but at least it tries.)

I'm reading 7.4 source code and I fail to see how it does this. If the
relation starts with an atypical distribution of wide/narrow or
dead/alive tuples, a wrong value for tuplesperpage is used for the rest
of the sampling.

Tuples immediately following one or more dead tuples have a better
chance of being selected. This may be called as random as anything else
and not favouring a special property. OTOH after long runs of dead
tuples consecutive tuples are likely to be selected.

Your comment about nonuniformity above exactly describes the current
algorithm: Once the initial sample is fetched and tuplesperpage is
determined, targpos is computed without any further feedback. If
targpos points to a sparsely populated area (with wide tuples or with
many dead tuples) tuples in this area are more likely to get into the
sample than tuples in densely populated areas (with many small active
tuples).

I think that cutting down the number of blocks to be looked at does not
affect these problems.

Servus
Manfred

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#8)
hackersgeneral
Re: [GENERAL] Large DB

Manfred Koizar <mkoi-pg@aon.at> writes:

On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples.

This sounds like you are assuming that I want to take exactly one tuple
out of each block of the block sample. This is not the case.

No, I understood that you wanted to resample, but [ ... thinks for
awhile ... ] hmm, now I can't construct a failure case either. I must
have done the math wrong before.

There's still a risk of not being able to collect N rows out of N
blocks, if you are unfortunate enough to select a lot of wholly-empty
pages. But that seems like a low-probability scenario; besides such a
table would be so desperately in need of VACUUM FULL that the possible
low quality of the stats hardly matters.

You should not need to use the Vitter algorithm for the block-level
selection, since you can know the number of blocks in the table in
advance. You can just use the traditional method of choosing each block
or not with probability (k/K), where k = number of sample blocks still
needed, K = number of blocks from here to the end. You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.

I do like this, since it eliminates the current method's bias towards
estimating the number of live rows from the density found near the start
of the table only. At the end you'd know the total number of live rows
on all the pages you read, and it's reasonable to extrapolate that total
to the full table size.

Question: if the table size is less than N blocks, are you going to read
every block or try to reduce the number of blocks sampled? If you don't
adjust the sample size then I think this would perform worse for
intermediate-size tables than the current method does ... perhaps not so
much at sample size = 3000, but at larger sizes it would hurt. A lot of
people are setting the stats target to 100 which means a sample size of
30000 --- how do the page-access counts look in that case?

regards, tom lane

#10Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#9)
hackersgeneral
Re: [GENERAL] Large DB

On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

You should not need to use the Vitter algorithm for the block-level
selection, since you can know the number of blocks in the table in
advance. You can just use the traditional method of choosing each block
or not with probability (k/K), where k = number of sample blocks still
needed, K = number of blocks from here to the end.

Sounds reasonable. I have to play around a bit more to get a feeling
where the Vitter method gets more efficient.

You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.

You mean once a block is sampled we inspect it in any case? This was
not the way I had planned to do it, but I'll keep this idea in mind.

Question: if the table size is less than N blocks, are you going to read
every block or try to reduce the number of blocks sampled?

Don't know yet.

people are setting the stats target to 100 which means a sample size of
30000 --- how do the page-access counts look in that case?

rel | page
size | reads
------+-------------
300 | 300
3000 | 3000
5000 | 4999
10K | 9.9K
30K | 25.8K
300K | 85K
1M | 120K
10M | 190K
100M | 260K
1G | 330K

This is exactly the table I posted before (for sample size 3000) with
every entry multiplied by 10. Well, not quite exactly, but the
differences are far behind the decimal point. So for our purposes, for
a given relation size the number of pages accessed is proportional to
the sample size.

Servus
Manfred

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#10)
hackersgeneral
Re: [GENERAL] Large DB

Manfred Koizar <mkoi-pg@aon.at> writes:

You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.

You mean once a block is sampled we inspect it in any case? This was
not the way I had planned to do it, but I'll keep this idea in mind.

Well, once we've gone to the trouble of reading in a block we
definitely want to count the tuples in it, for the purposes of
extrapolating the total number of tuples in the relation. Given
that, I think the most painless route is simply to use the Vitter
algorithm with the number-of-tuples-scanned as the count variable.
You could dump the logic in acquire_sample_rows that tries to estimate
where to read the N'th tuple from.

If you like I can send you the Vitter paper off-list (I have a PDF of
it). The comments in the code are not really intended to teach someone
what it's good for ...

regards, tom lane

#12Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#11)
hackersgeneral
Re: [GENERAL] Large DB

On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you like I can send you the Vitter paper off-list (I have a PDF of
it). The comments in the code are not really intended to teach someone
what it's good for ...

Yes, please. [Would have sent this off-list. But I'm blacklisted.]

Servus
Manfred

#13Mooney, Ryan
ryan.mooney@pnl.gov
In reply to: Manfred Koizar (#12)
general

On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan"
<ryan.mooney@pnl.gov>
wrote:

I haven't run ANALYZE on this table in a while. After about 50-60M
rows it didn't seem to change the query plan at all and since there
were never any deletes/updates it seemed like it wasn't

making much/any

difference (should have been no pages to reclaim).

Reclaiming pages is not the job of ANALYSE, VACUUM does this.

Sorry, yes that's what I meant - lack of sleep :>

Might be a valid assumption as well -- if you're lucky. But
do you want to depend on luck? Eg. 75% of the today's rows
contain timestamps that are greater than what the planner
believes to be the maximum.

Point taken. See vacuum info below...

No VACCUM, no ANALYSE, no REINDEX. This explains why the
planner thinks there are only 4M pages, which gives 640
bytes/row if there were 50M rows at that time. OTOH the
EXPLAIN shows 290M rows for the seq scan. Something doesn't
fit together here.

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. I'm
running
another one now to see how it does. I had turned off data updates
during
the vacuum below, I've turned them back on for the current run to get
some
idea about how its going to work ongoing, this means that the machine
was
basically idle during the vacuum so that's a pretty good picture of how
long just the vacuum took.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss so I just turned on
zero_damaged_pages = true (a nice feature would be to be able to set
this by table - I didn't see any way to do that), I did try doing
the "select ctid from tp3 offset 10257032 limit 20;" and deleting
rows with those ctid's, but even after deleting several hundred rows
it was still complaining, so I gave up and moved on...

INFO: vacuuming "public.tp3"
INFO: index "tp3_host_starttime" now contains 1401476187 tuples in
19778677 pages
DETAIL: 11657230 index tuples were removed.
108729 index pages have been deleted, 1 are currently reusable.
CPU 892.96s/602.59u sec elapsed 149710.57 sec.
ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

I had thought that there had never been any deletes/updates but my
co-worker pointed
out that we did do some updates on a couple million rows a couple months
back, so that
explains the removed index tuples.

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient, at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Db => explain analyse select point, avg(pvalue) as avg from tp3 where
host in ('m480', 'm456', 'm455', 'm453', 'm452', 'm451', 'm450', 'm449',
'm368', 'm367', 'm366', 'm365', 'm364', 'm363', 'm362', 'm361', 'm783',
'm782', 'm781', 'm779', 'm778', 'm776', 'm774', 'm739', 'm738', 'm737',
'm736', 'm735', 'm732', 'm731', 'm730', 'm727') and starttime between
'2004-03-30 07:28:25' and '2004-03-30 08:37:33' group by point;

HashAggregate (cost=195.08..195.08 rows=1 width=25) (actual
time=98667.32..98667.37 rows=12 loops=1)
-> Index Scan using tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)
Index Cond: (((host = 'm480'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm456'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm455'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm453'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm452'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm451'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm450'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm449'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm368'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm367'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm366'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm365'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm364'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm363'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm362'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm361'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm783'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm782'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm781'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm779'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm778'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm776'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm774'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm739'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm738'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm737'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm736'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm735'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm732'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm731'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm730'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm727'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)))
Filter: (((host = 'm480'::bpchar) OR (host = 'm456'::bpchar) OR
(host = 'm455'::bpchar) OR (host = 'm453'::bpchar) OR (host =
'm452'::bpchar) OR (host = 'm451'::bpchar) OR (host = 'm450'::bpchar) OR
(host = 'm449'::bpchar) OR (host = 'm368'::bpchar) OR (host =
'm367'::bpchar) OR (host = 'm366'::bpchar) OR (host = 'm365'::bpchar) OR
(host = 'm364'::bpchar) OR (host = 'm363'::bpchar) OR (host =
'm362'::bpchar) OR (host = 'm361'::bpchar) OR (host = 'm783'::bpchar) OR
(host = 'm782'::bpchar) OR (host = 'm781'::bpchar) OR (host =
'm779'::bpchar) OR (host = 'm778'::bpchar) OR (host = 'm776'::bpchar) OR
(host = 'm774'::bpchar) OR (host = 'm739'::bpchar) OR (host =
'm738'::bpchar) OR (host = 'm737'::bpchar) OR (host = 'm736'::bpchar) OR
(host = 'm735'::bpchar) OR (host = 'm732'::bpchar) OR (host =
'm731'::bpchar) OR (host = 'm730'::bpchar) OR (host = 'm727'::bpchar))
AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone)
AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone))
Total runtime: 98760.26 msec
(5 rows)

Hackers, what could update reltuples, but not relpages?

Or, Ryan, is it possible that you already had 290M rows when
you ran ANALYSE and you have more than 1G rows today?

Yes, appears to be the case...

BTW, ANALYSE is basically a constant time operation.

Having followed the ongoing discusion about this I can concur that it is

definitely NOT O(1). Unfortunately I didn't track the "time to vacuum"
as
The table grew. However I believe that I'm going to follow the
suggestions
about reducing the table size so I'll have a brand new BD to play with
in a
couple weeks, so knowing what I know now, I can track that if anyones
interested in the data besides me :)

Show quoted text

Servus
Manfred

#14Manfred Koizar
mkoi-pg@aon.at
In reply to: Mooney, Ryan (#13)
general

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up. It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows. This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...). If you didn't touch either, the sample size is
3000 rows. Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow proportionally to
the number of pages, nblocks. The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss

ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss. You
could simply drop and re-create that index. That might take some time,
though :-(

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient,

I think you got that backwards. If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Yes, it should be (starttime, host). And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows makes
it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can concur that it is
definitely NOT O(1). Unfortunately I didn't track the "time to vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions
about reducing the table size so I'll have a brand new BD to play with
in a couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones
interested in the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

#15Ericson Smith
eric@did-it.com
In reply to: Manfred Koizar (#14)
general

I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards, 
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+ 

Manfred Koizar wrote:

Show quoted text

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up. It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows. This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...). If you didn't touch either, the sample size is
3000 rows. Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow proportionally to
the number of pages, nblocks. The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss

ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss. You
could simply drop and re-create that index. That might take some time,
though :-(

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient,

I think you got that backwards. If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Yes, it should be (starttime, host). And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows makes
it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can concur that it is
definitely NOT O(1). Unfortunately I didn't track the "time to vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions
about reducing the table size so I'll have a brand new BD to play with
in a couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones
interested in the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#16Noname
Bob.Henkel@hartfordlife.com
In reply to: Ericson Smith (#15)
general

I'm a fairly new Postgresql user. And a long time Oracle user so keep that
in mind.

So you are telling me that for 206238 rows on a table the best route is to
break the table into separate tables? To me that is horrible to have to do
unless you really wanted to do that for some reason. And unless the count
is way higher then I read how is 206238 a large amout of rows on a table?

|---------+---------------------------------->
| | Ericson Smith |
| | <eric@did-it.com> |
| | Sent by: |
| | pgsql-general-owner@pos|
| | tgresql.org |
| | |
| | |
| | 04/06/2004 08:27 AM |
| | |
|---------+---------------------------------->

------------------------------------------------------------------------------------------------------------------------------|

| |
| To: Manfred Koizar <mkoi-pg@aon.at> |
| cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org |
| Subject: Re: [GENERAL] Large DB |

------------------------------------------------------------------------------------------------------------------------------|

I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+

Manfred Koizar wrote:

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up. It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows. This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...). If you didn't touch either, the sample size is
3000 rows. Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow proportionally to
the number of pages, nblocks. The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss

ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss. You
could simply drop and re-create that index. That might take some time,
though :-(

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient,

I think you got that backwards. If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Yes, it should be (starttime, host). And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows makes
it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can concur that it is
definitely NOT O(1). Unfortunately I didn't track the "time to vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions
about reducing the table size so I'll have a brand new BD to play with
in a couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones
interested in the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

(See attached file: eric.vcf)
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies.
*************************************************************************

Attachments:

eric.vcfapplication/octet-stream; name=eric.vcfDownload
#17Mooney, Ryan
ryan.mooney@pnl.gov
In reply to: Noname (#16)
general

Actually in the case referenced I have over 1.4 billion rows. The
206238 row example
should be no problem. I had no (realistic based disk I/O bandwidth,
etc..)
performance issues whatsoever in the hundreds of millions of rows (even
with
sub-optimal indexes). My average query has to reference 2-30 Million
rows, so
yes that can take up to a few minutes, but even dd'ing that amount of
data to/from
the disk array took almost as long (that's what I used as my base
benchmark, if
my query approaches that I don't see any room to complain :>)

I could only really see 206K rows being a major issue IF they were
really large rows
AND you were doing really complex queries AND the indexes were really
poorly defined,
or you were on REALLY slow hardware. Unfortunately I didn't follow that
thread as
closely, so I don't remember exactly what the issues expressed there
were.

Show quoted text

-----Original Message-----
From: Bob.Henkel@hartfordlife.com
[mailto:Bob.Henkel@hartfordlife.com]
Sent: Tuesday, April 06, 2004 2:17 PM
To: Ericson Smith
Cc: Manfred Koizar; pgsql-general@postgresql.org;
pgsql-general-owner@postgresql.org; Mooney, Ryan
Subject: Re: [GENERAL] Large DB

I'm a fairly new Postgresql user. And a long time Oracle
user so keep that in mind.

So you are telling me that for 206238 rows on a table the
best route is to break the table into separate tables? To me
that is horrible to have to do unless you really wanted to do
that for some reason. And unless the count is way higher
then I read how is 206238 a large amout of rows on a table?

|---------+---------------------------------->
| | Ericson Smith |
| | <eric@did-it.com> |
| | Sent by: |
| | pgsql-general-owner@pos|
| | tgresql.org |
| | |
| | |
| | 04/06/2004 08:27 AM |
| | |
|---------+---------------------------------->

-------------------------------------------------------------

-----------------------------------------------------------------|
|
|
| To: Manfred Koizar <mkoi-pg@aon.at>
|
| cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>,
pgsql-general@postgresql.org
|
| Subject: Re: [GENERAL] Large DB
|

-------------------------------------------------------------

-----------------------------------------------------------------|

I've been following this thread with interest since it
started, and it really seems that there is just too much data
in that single table. When it comes down to it, making
smaller separate tables seems to be the way to go. Querying
will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+

Manfred Koizar wrote:

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan"
<ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different

commands.

VACUUM is for cleaning up. It has to visit every tuple in

every page,

and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index

several times to

delete all index entries pointing to dead rows. This might

raise the

cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table,

the sample

size depends on default_statistics_target and the maximum value you
have set any column's statistics target to (ALTER TABLE ... ALTER
COLUMN ... SET STATISTICS ...). If you didn't touch either,

the sample

size is 3000 rows. Then these 3000 rows are sorted and counted in
different ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow

proportionally

to the number of pages, nblocks. The cost growth rate is

greater than

O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should

be doable

in a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel (for
other reasons than DB performance), its been there for a

while, and I

can deal w/ the data loss

ERROR: invalid page header in block 10257032 of

"tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data

loss. You

could simply drop and re-create that index. That might take

some time,

though :-(

Here is the explain analyse, you can see why I think that

an index on

just host might be better - hosts are a small set, starttime is a
large set so the index should be more
Efficient,

I think you got that backwards. If there are not many

hosts, then an

index on host is not very selective, IOW you get a lot of

hits when you

look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an

index is most

efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter

(I suspect

its not???).

Yes, it should be (starttime, host). And no, PG is

generally not smart

enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows
makes it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less

index tuples,

and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can

concur that it

is definitely NOT O(1). Unfortunately I didn't track the "time to
vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions about
reducing the table size so I'll have a brand new BD to play

with in a

couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones

interested in

the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org

so that your

message can get through to the mailing list cleanly

(See attached file: eric.vcf)
---------------------------(end of
broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index
scan if your
joining column's datatypes do not match

**************************************************************
***********
PRIVILEGED AND CONFIDENTIAL: This communication, including
attachments, is for the exclusive use of addressee and may
contain proprietary, confidential and/or privileged
information. If you are not the intended recipient, any use,
copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient,
please notify the sender immediately by return e-mail, delete
this communication and destroy all copies.
**************************************************************
***********

#18Gavin M. Roy
gmr@ehpg.net
In reply to: Noname (#16)
general

I don't think that's what's being said, I run tables with tens of
millions of rows without issue.

Gavin

Bob.Henkel@hartfordlife.com wrote:

Show quoted text

I'm a fairly new Postgresql user. And a long time Oracle user so keep that
in mind.

So you are telling me that for 206238 rows on a table the best route is to
break the table into separate tables? To me that is horrible to have to do
unless you really wanted to do that for some reason. And unless the count
is way higher then I read how is 206238 a large amout of rows on a table?

|---------+---------------------------------->
| | Ericson Smith |
| | <eric@did-it.com> |
| | Sent by: |
| | pgsql-general-owner@pos|
| | tgresql.org |
| | |
| | |
| | 04/06/2004 08:27 AM |
| | |
|---------+---------------------------------->

------------------------------------------------------------------------------------------------------------------------------|

| |
| To: Manfred Koizar <mkoi-pg@aon.at> |
| cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org |
| Subject: Re: [GENERAL] Large DB |

------------------------------------------------------------------------------------------------------------------------------|

I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+

Manfred Koizar wrote:

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up. It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows. This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...). If you didn't touch either, the sample size is
3000 rows. Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow proportionally to
the number of pages, nblocks. The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss

ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss. You
could simply drop and re-create that index. That might take some time,
though :-(

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient,

I think you got that backwards. If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Yes, it should be (starttime, host). And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows makes
it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can concur that it is
definitely NOT O(1). Unfortunately I didn't track the "time to vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions
about reducing the table size so I'll have a brand new BD to play with
in a couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones
interested in the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

(See attached file: eric.vcf)
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies.
*************************************************************************

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

#19Ericson Smith
eric@did-it.com
In reply to: Gavin M. Roy (#18)
general

My point was that, if multi-minute query response times were ok, then
you can continue. But if the row sizes are really large, or you have
lots of text data, or there are tons of updates, and you want sub-second
response times -- then I see no option but for you to beef up hardware
dramatically, or indeed --- split the data into smaller tables that you
can still logically query.

That's what we were forced to do -- and trust me, we looked at our
schemas for over a year before deciding to do that. It paid off
hansomely. It's not such a horrible alternative as you might think.

- Ericson Smith

Gavin M. Roy wrote:

Show quoted text

I don't think that's what's being said, I run tables with tens of
millions of rows without issue.

Gavin

Bob.Henkel@hartfordlife.com wrote:

I'm a fairly new Postgresql user. And a long time Oracle user so
keep that
in mind.

So you are telling me that for 206238 rows on a table the best route
is to
break the table into separate tables? To me that is horrible to have
to do
unless you really wanted to do that for some reason. And unless the
count
is way higher then I read how is 206238 a large amout of rows on a
table?

|---------+---------------------------------->
| | Ericson Smith |
| | <eric@did-it.com> |
| | Sent by: |
| | pgsql-general-owner@pos|
| | tgresql.org |
| | |
| | |
| | 04/06/2004 08:27 AM |
| | |
|---------+---------------------------------->

------------------------------------------------------------------------------------------------------------------------------|

|
|
| To: Manfred Koizar
<mkoi-pg@aon.at>
|
| cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>,
pgsql-general@postgresql.org |
| Subject: Re: [GENERAL] Large
DB
|

------------------------------------------------------------------------------------------------------------------------------|

I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+

Manfred Koizar wrote:

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan"
<ryan.mooney@pnl.gov>
wrote:

Ok, so I ran a vacuum analyse. It took ~1.7 days to finish.

Just to make it clear: VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up. It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table. If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows. This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you
have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...). If you didn't touch either, the sample size is
3000 rows. Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow
proportionally to
the number of pages, nblocks. The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks). I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be
doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss

ERROR: invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss. You
could simply drop and re-create that index. That might take some time,
though :-(

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient,

I think you got that backwards. If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host. OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Yes, it should be (starttime, host). And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

-> Index Scan using tp3_host_starttime, tp3_host_starttime,
[...], tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here. Furtunately your huge number of rows
makes
it rule out every other (most probably slower) plan.

How many hosts are there? Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples. An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

Having followed the ongoing discusion about this I can concur that
it is
definitely NOT O(1). Unfortunately I didn't track the "time to
vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

However I believe that I'm going to follow the suggestions
about reducing the table size so I'll have a brand new BD to play with
in a couple weeks,

Hopefully we'll see a success story here.

so knowing what I know now, I can track that if anyones
interested in the data besides me :)

VACUUM and ANALYSE times? Certainly.

Servus
Manfred

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

(See attached file: eric.vcf)
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

*************************************************************************

PRIVILEGED AND CONFIDENTIAL: This communication, including
attachments, is for the exclusive use of addressee and may contain
proprietary, confidential and/or privileged information. If you are
not the intended recipient, any use, copying, disclosure,
dissemination or distribution is strictly prohibited. If you are not
the intended recipient, please notify the sender immediately by
return e-mail, delete this communication and destroy all copies.
*************************************************************************

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings