Huge Data

Started by Sezai YILMAZabout 22 years ago14 messagesgeneral
Jump to latest
#1Sezai YILMAZ
sezai.yilmaz@pro-g.com.tr

Hi,

I use PostgreSQL 7.4 for storing huge amount of data. For example 7
million rows. But when I run the query "select count(*) from table;", it
results after about 120 seconds. Is this result normal for such a huge
table? Is there any methods for speed up the querying time? The huge
table has integer primary key and some other indexes for other columns.

The hardware is: PIII 800 MHz processor, 512 MB RAM, and IDE hard disk
drive.

-sezai

#2Richard Huxton
dev@archonet.com
In reply to: Sezai YILMAZ (#1)
Re: Huge Data

On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:

Hi,

I use PostgreSQL 7.4 for storing huge amount of data. For example 7
million rows. But when I run the query "select count(*) from table;", it
results after about 120 seconds. Is this result normal for such a huge
table? Is there any methods for speed up the querying time? The huge
table has integer primary key and some other indexes for other columns.

PG uses MVCC to manage concurrency. A downside of this is that to verify the
exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

--
Richard Huxton
Archonet Ltd

#3Sezai YILMAZ
sezai.yilmaz@pro-g.com.tr
In reply to: Richard Huxton (#2)
Re: Huge Data

Richard Huxton wrote:

On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:

Hi,

I use PostgreSQL 7.4 for storing huge amount of data. For example 7
million rows. But when I run the query "select count(*) from table;", it
results after about 120 seconds. Is this result normal for such a huge
table? Is there any methods for speed up the querying time? The huge
table has integer primary key and some other indexes for other columns.

PG uses MVCC to manage concurrency. A downside of this is that to verify the
exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

I use count() for some statistics. Just to show how many records
collected so far.

-sezai

#4Sezai YILMAZ
sezai.yilmaz@pro-g.com.tr
In reply to: Richard Huxton (#2)
Re: Huge Data

Richard Huxton wrote:

PG uses MVCC to manage concurrency. A downside of this is that to verify the
exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

-sezai

#5Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Sezai YILMAZ (#3)
Re: Huge Data

On Wednesday 14 January 2004 17:57, Sezai YILMAZ wrote:

Richard Huxton wrote:

What are you using the count() for?

I use count() for some statistics. Just to show how many records
collected so far.

Rather than doing count(*), you should either cache the count in application
memory

or analyze often and use following.

'select reltuples from pg_class where relname = 'foo';

This would give you approximate count. I believe it should suffice for your
needs.

HTH

Shridhar

#6Matthew Lunnon
mlunnon@rwa-net.co.uk
In reply to: Sezai YILMAZ (#1)
Re: Huge Data

Have you run 'vacuum analyze log;'? Also I believe that in Oracle count(1) used to be quicker than count(*).
Matthew
----- Original Message -----
From: Sezai YILMAZ
To: Richard Huxton
Cc: pgsql-general@postgresql.org
Sent: Wednesday, January 14, 2004 12:39 PM
Subject: Re: [GENERAL] Huge Data

Richard Huxton wrote:

PG uses MVCC to manage concurrency. A downside of this is that to verify the
exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

-sezai

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

_____________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com

#7Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Sezai YILMAZ (#4)
Re: Huge Data

On Wednesday 14 January 2004 18:09, Sezai YILMAZ wrote:

Richard Huxton wrote:

PG uses MVCC to manage concurrency. A downside of this is that to verify
the exact number of rows in a table you have to visit them all.

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

select logid, agentid, logbody from log where logid=3000000;

How about

select logid, agentid, logbody from log where logid='3000000';

or

select logid, agentid, logbody from log where logid=3000000::int4

Basically you need to typecast the constant. Then it would use the index.

I am not sure of first form of it though. I recommend you use the later form.

Shridhar

#8Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Matthew Lunnon (#6)
Re: Huge Data

On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

How about

select logid, agentid, logbody from log where logid='3000000';

or

select logid, agentid, logbody from log where logid=3000000::int4;

Basically you need to typecast the constant. Then it would use the index.

I am not sure of first form of it though. I recommend you use the later form.

Shridhar

#9Richard Huxton
dev@archonet.com
In reply to: Sezai YILMAZ (#4)
Re: Huge Data

On Wednesday 14 January 2004 12:39, Sezai YILMAZ wrote:

select logid, agentid, logbody from log where logid=3000000;

At a guess, because logid is bigint, whereas 300000 is taken to be integer.
Try ... where logid = 300000::bigint;

This is in the FAQ too I think, and is certainly in the archives.

Other things you might come across:
SELECT max() involves a sequential scan just like count(), you can rewrite it
as SELECT target_column FROM my_table ORDER BY target_column DESC LIMIT 1

The config values are very conservative. You will definitely want to tune them
for performance. See the articles here for a good introduction:
http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php

The VACUUM command is used to reclaim unused space, and the ANALYZE command to
regenerate statistics. It's worth reading up on both.

You can use EXPLAIN ANALYSE <query here> to see the plan that PG uses. I think
there's a discussion of it at http://techdocs.postgresql.org/

--
Richard Huxton
Archonet Ltd

#10Richard Huxton
dev@archonet.com
In reply to: Sezai YILMAZ (#3)
Re: Huge Data

On Wednesday 14 January 2004 12:27, Sezai YILMAZ wrote:

Richard Huxton wrote:

There's plenty on this in the archives, and probably the FAQ too.

What are you using the count() for?

I use count() for some statistics. Just to show how many records
collected so far.

If you want an accurate number without scanning the table, you'll need to use
a trigger to keep a count up to date.

--
Richard Huxton
Archonet Ltd

#11Sezai YILMAZ
sezai.yilmaz@pro-g.com.tr
In reply to: Shridhar Daithankar (#5)
Re: Huge Data

Shridhar Daithankar wrote:

Rather than doing count(*), you should either cache the count in application
memory

or analyze often and use following.

'select reltuples from pg_class where relname = 'foo';

Thank you very much Shridhar. This one is responsive immediately. I
think I will use this method for gathering row count. But I complain to
break SQL standards. The code will become unmovable.

-sezai

#12Franco Bruno Borghesi
franco@akyasociados.com.ar
In reply to: Sezai YILMAZ (#11)
Re: Huge Data

If the mentioned solution fits your needs, you could create a stored
procedure using that. The postgreSQL implementation could select from
pg_class, while the same function in other database could execute the
select count() on the table.

On Wed, 2004-01-14 at 10:25, Sezai YILMAZ wrote:

Show quoted text

Shridhar Daithankar wrote:

Rather than doing count(*), you should either cache the count in application
memory

or analyze often and use following.

'select reltuples from pg_class where relname = 'foo';

Thank you very much Shridhar. This one is responsive immediately. I
think I will use this method for gathering row count. But I complain to
break SQL standards. The code will become unmovable.

-sezai

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#13Sezai YILMAZ
sezai.yilmaz@pro-g.com.tr
In reply to: Shridhar Daithankar (#8)
Re: Huge Data

Shridhar Daithankar wrote:

On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:

select logid, agentid, logbody from log where logid=3000000;

this query also returns after about 120 seconds. The table log has about
7 million records, and logid is the primary key of log table. What about
that? Why is it too slow?

How about

select logid, agentid, logbody from log where logid='3000000';

Oh my god. It is unbelievable. The result is great. Thanks for all guys
who helped me.

-sezai

#14Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Sezai YILMAZ (#11)
Re: Huge Data

On Wednesday 14 January 2004 18:55, Sezai YILMAZ wrote:

Shridhar Daithankar wrote:

Rather than doing count(*), you should either cache the count in
application memory

or analyze often and use following.

'select reltuples from pg_class where relname = 'foo';

Thank you very much Shridhar. This one is responsive immediately. I
think I will use this method for gathering row count. But I complain to
break SQL standards. The code will become unmovable.

Well, you could document it somewhere for your reference. It is not that
hard..:-)

And remember, the value you get is just an estimate. You need to analyze the
table often with respect to it's update/insert/delete activity to keep the
estimate reasonable accurate. Vacuuming would also update the estimate.

Shridhar