"critical mass" reached?

Started by Alex Howanskyabout 25 years ago6 messagesgeneral
Jump to latest
#1Alex Howansky
alex@wankwood.com

Using 7.0.3, I've got a database that has about 30 tables. One in particular
seems to be giving us problems. It's a pretty simple table with one index that
logs certain web site activity. It gets about 100K insertions per day, and now
has about 6 million records.

All of a sudden (yesterday), we seem to have reached a "critical mass". No
other tables or code have changed significantly (or more than normal). However,
the database performance has abruptly become abyssmal -- the server which
previously hadn't broken a load average of 4 now spikes continuously between 20
and 40, rarely dropping below 10. Web server logs show normal activity. Also,
in the pgsql/data/base/dbname directory, I'm getting hundreds of pg_sorttemp
and pg_noname files lying around.

I thought there might be some data or index corruption, so I've even gone so
far as to initdb and recreate the database from scratch, using a previous
pg_dump output, but it has not helped.

Six million tables doesn't seem like it should be too much of a problem, but we
suspect this table might be the cause of the problem because it's the only one
that changes significantly from day to day. Memory is ok, there is no swapping,
disk space is plentiful, I don't know where else to look. Any ideas?

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/

#2Richard Huxton
dev@archonet.com
In reply to: Alex Howansky (#1)
Re: "critical mass" reached?

From: "Alex Howansky" <alex@wankwood.com>

Using 7.0.3, I've got a database that has about 30 tables. One in

particular

seems to be giving us problems. It's a pretty simple table with one index

that

logs certain web site activity. It gets about 100K insertions per day, and

now

has about 6 million records.

All of a sudden (yesterday), we seem to have reached a "critical mass". No
other tables or code have changed significantly (or more than normal).

However,

the database performance has abruptly become abyssmal -- the server which
previously hadn't broken a load average of 4 now spikes continuously

between 20

and 40, rarely dropping below 10. Web server logs show normal activity.

Also,

in the pgsql/data/base/dbname directory, I'm getting hundreds of

pg_sorttemp

and pg_noname files lying around.

Presumably you're running vacuum analyze regularly (at least once a day I'd
guess) so I can only suspect that something has tipped the balance in the
cost estimations. Is there a particular query that's slow and can you post
an EXPLAIN?

I thought there might be some data or index corruption, so I've even gone

so

far as to initdb and recreate the database from scratch, using a previous
pg_dump output, but it has not helped.

Looks like you've ruled out damage to the DB. What happens if you delete 3
million of the records in your log-table?

Six million tables doesn't seem like it should be too much of a problem,

but we

suspect this table might be the cause of the problem because it's the only

one

that changes significantly from day to day. Memory is ok, there is no

swapping,

disk space is plentiful, I don't know where else to look. Any ideas?

Six million _tables_ is a lot, but you're right 6M records is pretty small
compared to what some people are using.

See if you can't post an EXPLAIN of a problem query and the relevant table
defs.

- Richard Huxton

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Howansky (#1)
Re: "critical mass" reached?

Alex Howansky <alex@wankwood.com> writes:

[ lots of bad stuff ]

Hm. As Richard remarks, 6M records is not an especially big table;
there are people running larger ones. The leftover sorttemp files sound
like you are suffering backend crashes --- but you didn't mention
anything about unexpected disconnects.

The postmaster log would be a good place to look for more info (if
you're not keeping one, turn it on). Also, make sure the postmaster
is not being run with an environment of "ulimit -c 0" ... if the
backends are crashing, we want to get some core files so we can see
what's happening.

regards, tom lane

#4Alex Howansky
alex@wankwood.com
In reply to: Richard Huxton (#2)
Re: "critical mass" reached?

Presumably you're running vacuum analyze regularly (at least once a day I'd
guess) so I can only suspect that something has tipped the balance in the
cost estimations. Is there a particular query that's slow and can you post
an EXPLAIN?

Oops, yes, sorry forgot to mention that. Vacuum analyze run nightly. There is
not just one particluar query that runs slow -- it's the database as a whole
(while apparently under the same average everyday load).

Looks like you've ruled out damage to the DB. What happens if you delete 3
million of the records in your log-table?

We haven't got that far yet. I was hoping to get some other ideas prior to
doing something so drastic, but we'll try it ...

Six million _tables_ is a lot, but you're right 6M records is pretty small
compared to what some people are using.

Oops again. I gotta stop trying to debug at 3am... :)

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/

#5Alex Howansky
alex@wankwood.com
In reply to: Tom Lane (#3)
Re: "critical mass" reached?

Hm. As Richard remarks, 6M records is not an especially big table;
there are people running larger ones. The leftover sorttemp files sound
like you are suffering backend crashes --- but you didn't mention
anything about unexpected disconnects.

I haven't noticed any myself, but the majority of our connections come from a
public web based app -- so I can't really tell if the consumer is experiencing
problems or not.

The postmaster log would be a good place to look for more info (if
you're not keeping one, turn it on).

I have a debug level 2 log of almost the entire day's activity. I scanned it
briefly but found nothing (it's 180 meg). Is there anything in particular I
should be looking for?

Also, make sure the postmaster is not being run with an environment of
"ulimit -c 0" ... if the backends are crashing, we want to get some core
files so we can see what's happening.

Ok, will verify. Thanks.

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex Howansky (#5)
Re: "critical mass" reached?

Alex Howansky <alex@wankwood.com> writes:

The postmaster log would be a good place to look for more info (if
you're not keeping one, turn it on).

I have a debug level 2 log of almost the entire day's activity.

Oh good. Hang onto that.

I scanned it
briefly but found nothing (it's 180 meg). Is there anything in particular I
should be looking for?

Unusual errors, reports of subprocess crashes ...

regards, tom lane