Optimising SELECT on a table with one million rows

Started by Cultural Sublimationover 18 years ago7 messagesgeneral

cultural_sublimation@yahoo.com

over 18 years ago

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure. That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it. The structure of the database
is therefore fairly simple: (there are also some sequences, which I've
omitted for clarity)

CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);

CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);

CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);

I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;

The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
-> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query. Some clever use of indices, perhaps? Or is
the way I am now constructing the select non-optimal? Or do I need
some pixie-magic on the Postgresql settings? Anyway, any suggestions
are welcome! (and thanks in advance)

Regards,
C.S.

____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

Bryan Murphy

bryan.murphy@gmail.com

over 18 years ago

In reply to: Cultural Sublimation (#1)

Re: Optimising SELECT on a table with one million rows

First question... did you create the appropriate indexes on the appropriate
columns for these tables? Foreign keys do not implicitly create indexes in
postgres.

Bryan

Show quoted text

On 7/30/07, Cultural Sublimation <cultural_sublimation@yahoo.com> wrote:

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure. That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it. The structure of the database
is therefore fairly simple: (there are also some sequences, which I've
omitted for clarity)

CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);

CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);

CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT
NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);

I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author =
users.user_id;

The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id
)::integer)
-> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8)
(actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=
3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14)
(actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query. Some clever use of indices, perhaps? Or is
the way I am now constructing the select non-optimal? Or do I need
some pixie-magic on the Postgresql settings? Anyway, any suggestions
are welcome! (and thanks in advance)

Regards,
C.S.

____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

Richard Huxton

dev@archonet.com

over 18 years ago

In reply to: Cultural Sublimation (#1)

Re: Optimising SELECT on a table with one million rows

Cultural Sublimation wrote:

CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);

CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);

CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);

I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;

The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

What else is it supposed to do? You haven't created any indexes. I'm
also guessing that you haven't analysed the tables either.

--
Richard Huxton
Archonet Ltd

Rodrigo De León

rdeleonp@gmail.com

over 18 years ago

In reply to: Cultural Sublimation (#1)

Re: Optimising SELECT on a table with one million rows

On Jul 30, 12:01 pm, cultural_sublimat...@yahoo.com (Cultural
Sublimation) wrote:

Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
-> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

Create an index on comments.comment_story column.

Nis Jørgensen

nis@superlativ.dk

over 18 years ago

In reply to: Cultural Sublimation (#1)

Re: Optimising SELECT on a table with one million rows

Cultural Sublimation skrev:

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure. That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it. The structure of the database
is therefore fairly simple: (there are also some sequences, which I've
omitted for clarity)

CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);

CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);

CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);

You need indices on comment.comment_story (and probably later for
comment_author). You should ALWAYS add an index on a FOREIGN KEY column
unless you have a very good reason not to. So:

CREATE INDEX comments_story_idx ON comments(comment_story);
CREATE INDEX comments_author_idx ON comments(comment_author);
CREATE INDEX story_author_idx ON story(story_author);

Thge first of these should remove the need for a seqscan on comments for
your query. The seqscan on users is not a problem - you are returning
data from all the rows, so a seqscan is the smart thing to do.

Nis

Jon Sime

jsime@mediamatters.org

over 18 years ago

In reply to: Cultural Sublimation (#1)

Re: Optimising SELECT on a table with one million rows

Cultural Sublimation wrote:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;

The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
-> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

If you have no index on comments.comment_author, then a seqscan will be
required for your join between comments and users. Similarly, if you
have no index on comments.comment_story, then any query against comments
that uses that column as part of a predicate will require a seqscan of
the comments table.

Note that an FK constraint does not automatically create an index on the
underlying column. You need to create the actual index yourself if it
will be necessary for your queries.

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).

Partitioning on comments.comment_timestamp won't help you at all for
this particular query, since you don't have a condition in your query
dependent upon that value. It might help you for other queries (such as
gathering up all the comments posted on a particular day, or during some
other time range), but it won't make any positive difference for this query.

-Jon

--
Senior Systems Developer
Media Matters for America
http://mediamatters.org/

Cultural Sublimation

cultural_sublimation@yahoo.com

over 18 years ago

In reply to: Jon Sime (#6)

Re: Optimising SELECT on a table with one million rows

Hi,

If you have no index on comments.comment_author, then a seqscan will be
required for your join between comments and users. Similarly, if you
have no index on comments.comment_story, then any query against comments
that uses that column as part of a predicate will require a seqscan of
the comments table.

I see. As I said, I'm still fairly new to this...

Note that an FK constraint does not automatically create an index on the
underlying column. You need to create the actual index yourself if it
will be necessary for your queries.

I see what you mean. The basic idea then is to take a look at the typical
queries and to create indices based on them. Is there a good guide on index
creation for optimisation purposes?

Partitioning on comments.comment_timestamp won't help you at all for
this particular query, since you don't have a condition in your query
dependent upon that value. It might help you for other queries (such as
gathering up all the comments posted on a particular day, or during some
other time range), but it won't make any positive difference for this query.

You are right. Come to think of it, partitioning the comments table based
on comment_story might make more sense, since the overwhelming majority of
queries will be like the one I just mentioned: asking for all comments of
a given story.

Anyway, thanks a lot for your help! (And that goes for all the other people
who also given their 2 cents)

Best regards,
C.S.

____________________________________________________________________________________Ready for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/