Unacceptable postgres performance vs. Microsoft sqlserver

Started by tosbalok@gmail.comabout 18 years ago9 messagesgeneral
Jump to latest
#1tosbalok@gmail.com
tosbalok@gmail.com

Let me just start off by saying that I *want* to use postgresql.
That's my goal. I do not want to use SQLServer. I'm posting this
message not to slam postgres, but to ask for someone to help me figure
out what I'm doing wrong.

I've used postgres for several years as the backend to web
applications. I'm not entirely new to it, but I've never needed to
bother with performance tuning. Netflix is running a contest in which
they will give you a set of movie ratings and you try to develop an
algorithm to predict future ratings. I've imported this data into a
postgresql database, and the performance is abysmal. Here's my setup:

Core 2 Quad with 4GB RAM and two SATAII hard drives in RAID 0

I install debian linux and postgresql 8.1 (this is the latest version
that is available through the debian package manager). I import the
Netflix data into a table with the following characteristics:

Create table rating (movieid int, userid int, rating int4, ratingdate
date)

There are 180 million rows. I've done the import 3 times, it takes on
average 815 seconds. I'm not too worried about that. Now for a test,
I run the query, select count(*) from rating;

This takes an average of 172 seconds. Call it three minutes.

Now, on the same machine, I install windows XP and Microsoft SQL
server 2000. The import took 742 seconds. The count(*) query took 22
seconds.

22 seconds. What's gong on?

Another test. In postgres I added an index to the userid column and
then counted distinct userids. The average run time over three
queries was 4666 seconds, or 78 minutes. Unbelievable.

On SQL Server, with *no* index, the same query takes on average 414
seconds, or about 7 minutes. Ten times faster!

I'm sure someone will flame me and say that I'm a worthless noob and
if only I was smart I would know what I'm doing wrong. So let me just
repeat: I want to use postgres. That's my goal. Please tell me what
I can do to make the performance of this acceptable. It's not a
complicated setup. One table. A couple of count(*) queries.

#2Chris Browne
cbbrowne@acm.org
In reply to: tosbalok@gmail.com (#1)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

"tosbalok@gmail.com" <tosbalok@gmail.com> writes:

Let me just start off by saying that I *want* to use postgresql.
That's my goal. I do not want to use SQLServer. I'm posting this
message not to slam postgres, but to ask for someone to help me figure
out what I'm doing wrong.

I've used postgres for several years as the backend to web
applications. I'm not entirely new to it, but I've never needed to
bother with performance tuning. Netflix is running a contest in which
they will give you a set of movie ratings and you try to develop an
algorithm to predict future ratings. I've imported this data into a
postgresql database, and the performance is abysmal. Here's my setup:

Core 2 Quad with 4GB RAM and two SATAII hard drives in RAID 0

I install debian linux and postgresql 8.1 (this is the latest version
that is available through the debian package manager). I import the
Netflix data into a table with the following characteristics:

Create table rating (movieid int, userid int, rating int4, ratingdate
date)

There are 180 million rows. I've done the import 3 times, it takes on
average 815 seconds. I'm not too worried about that. Now for a test,
I run the query, select count(*) from rating;

This takes an average of 172 seconds. Call it three minutes.

Now, on the same machine, I install windows XP and Microsoft SQL
server 2000. The import took 742 seconds. The count(*) query took 22
seconds.

22 seconds. What's gong on?

Another test. In postgres I added an index to the userid column and
then counted distinct userids. The average run time over three
queries was 4666 seconds, or 78 minutes. Unbelievable.

On SQL Server, with *no* index, the same query takes on average 414
seconds, or about 7 minutes. Ten times faster!

I'm sure someone will flame me and say that I'm a worthless noob and
if only I was smart I would know what I'm doing wrong. So let me just
repeat: I want to use postgres. That's my goal. Please tell me what
I can do to make the performance of this acceptable. It's not a
complicated setup. One table. A couple of count(*) queries.

This is a well-understood issue...

Some database systems are able to do some optimization where either:
a) They collect some statistics to answer such queries in O(1) time, or
b) They can use index accesses and only look at an index.

The MVCC strategy in PostgreSQL, which allows it to avoid the need for
readers to block writers, and vice-versa, has the result that running
"count(*)" without any WHERE clause requires a scan across the entire
table.

If the entire purpose of your application is to run COUNT(*) against
the entireties of a table with 180M rows, then PostgreSQL may not be
the right database for your application.

If, on the other hand, this is a poor benchmark of your actual
requirements, then it would be a bad idea to draw any conclusions
based on the performance of "select count(*) from some_table;"
--
(format nil "~S@~S" "cbbrowne" "linuxfinances.info")
http://www3.sympatico.ca/cbbrowne/oses.html
"If a word in the dictionary were misspelled, how would we know?"
-- Steven Wright

#3David Wilson
david.t.wilson@gmail.com
In reply to: Chris Browne (#2)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Mon, Apr 14, 2008 at 1:34 PM, Chris Browne <cbbrowne@acm.org> wrote:

"tosbalok@gmail.com" <tosbalok@gmail.com> writes:

Another test. In postgres I added an index to the userid column and
then counted distinct userids. The average run time over three
queries was 4666 seconds, or 78 minutes. Unbelievable.

On SQL Server, with *no* index, the same query takes on average 414
seconds, or about 7 minutes. Ten times faster!

First, in general- use the EXPLAIN and EXPLAIN ANALYZE feature of
postgresql. That will tell you a lot about what your queries are doing
and why they're taking so long.

Second, make sure you've ANALYZE'd your table after creating it and
the index, which gives the planner the statistics necessary to make
intelligent choices.

For instance, your count of distinct userids is probably not using the
index you just created. If it still isn't using it after you ANALYZE
the table, try rewriting the query using group by (select count(*)
from (select userid from mytable group by userid) tmp). I recently had
a similar performance issue on a 75m row table, and the above helped.

VACUUM ANALYZE tables, and then remember that EXPLAIN and EXPLAIN
ANALYZE are your best friends.

--
- David T. Wilson
david.t.wilson@gmail.com

#4Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Chris Browne (#2)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Mon, 14 Apr 2008 13:34:36 -0400
Chris Browne <cbbrowne@acm.org> wrote:

Another test. In postgres I added an index to the userid column
and then counted distinct userids. The average run time over
three queries was 4666 seconds, or 78 minutes. Unbelievable.

On SQL Server, with *no* index, the same query takes on average
414 seconds, or about 7 minutes. Ten times faster!

Some database systems are able to do some optimization where either:
a) They collect some statistics to answer such queries in O(1)
time, or b) They can use index accesses and only look at an index.

The MVCC strategy in PostgreSQL, which allows it to avoid the need
for readers to block writers, and vice-versa, has the result that
running "count(*)" without any WHERE clause requires a scan across
the entire table.

If the entire purpose of your application is to run COUNT(*) against
the entireties of a table with 180M rows, then PostgreSQL may not be
the right database for your application.

If, on the other hand, this is a poor benchmark of your actual
requirements, then it would be a bad idea to draw any conclusions
based on the performance of "select count(*) from some_table;"

But why once you add the index and count distinct the performances
are still so far?
I'd say that counting in this case is not the hardest thing to do,
but rather the "distinct" part.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#5Andreas 'ads' Scherbaum
adsmail@wars-nicht.de
In reply to: Chris Browne (#2)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Mon, 14 Apr 2008 13:34:36 -0400 Chris Browne wrote:

If the entire purpose of your application is to run COUNT(*) against
the entireties of a table with 180M rows, then PostgreSQL may not be
the right database for your application.

For running a COUNT(*) without WHERE you could even add a trigger and a
second table which contains statistics. With this simple workaround you
have the answer for the COUNT(*) question in almost no time.

Getting results for a COUNT(*) with WHERE is more complicated.

Kind regards

--
Andreas 'ads' Scherbaum
German PostgreSQL User Group

#6Bruce Momjian
bruce@momjian.us
In reply to: Ivan Sergio Borgonovo (#4)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

"Ivan Sergio Borgonovo" <mail@webthatworks.it> writes:

But why once you add the index and count distinct the performances
are still so far?
I'd say that counting in this case is not the hardest thing to do,
but rather the "distinct" part.

Your tests have been a bit unfortunate in finding a few particularly soft bits
in Postgres's underbelly.

a) Postgres can't do "index-only" scans so the index doesn't really help in
this case. There's some discussion about improving that soon but it's still
early.

b) Postgres doesn't know how to pick just distinct values out of an index, it
has to find a start point and read all the records from that point forward
so even if this is a very low cardinality value it would have to read the
whole index.

c) DISTINCT is one of the earlier features in Postgres and a lot planner code
is a lot smarter. In particular it doesn't know about using hash tables to
find distinct values. This you can work around by rewriting the query as
GROUP BY goes through a more recent code path.

d) The default settings if you haven't tuned postgresql.conf are quite
conservative. That alone often makes it look poor when compared against
other databases. In particular it makes it unlikely to pick hash tables for
things or do in-memory sorts unless you raise work_mem.

This is what you're doing now:

postgres=# explain select count(distinct aid) from accounts;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=2838.00..2838.01 rows=1 width=4)
-> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4)
(2 rows)

This is what you're expecting it to do (on a freshly built clustered index it
might perform ok but on a non-clustered index it will be terrible):

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Aggregate (cost=5475.26..5475.27 rows=1 width=0)
-> Group (cost=0.00..4225.26 rows=100000 width=4)
-> Index Scan using accounts_pkey on accounts (cost=0.00..3975.26 rows=100000 width=4)
(3 rows)

postgres=# set work_mem = '128M';
ERROR: invalid value for parameter "work_mem": "128M"
HINT: Valid units for this parameter are "kB", "MB", and "GB".

(ARGH! DAMN YOU PETER!!!!)

This is probably the best you can get Postgres to do currently:

postgres=# set work_mem = '128MB';
SET

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
QUERY PLAN
----------------------------------------------------------------------------
Aggregate (cost=5088.00..5088.01 rows=1 width=0)
-> HashAggregate (cost=2838.00..3838.00 rows=100000 width=4)
-> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4)
(3 rows)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#7Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Bruce Momjian (#6)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Mon, 14 Apr 2008 21:30:15 +0100
Gregory Stark <stark@enterprisedb.com> wrote:

"Ivan Sergio Borgonovo" <mail@webthatworks.it> writes:

But why once you add the index and count distinct the performances
are still so far?
I'd say that counting in this case is not the hardest thing to do,
but rather the "distinct" part.

Your tests have been a bit unfortunate in finding a few
particularly soft bits in Postgres's underbelly.

While I digest the interesting things you wrote... are there any
comparative benchmark on count(*) with where clauses?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#8tosbalok@gmail.com
tosbalok@gmail.com
In reply to: tosbalok@gmail.com (#1)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Apr 14, 12:34 pm, Chris Browne <cbbro...@acm.org> wrote:

If the entire purpose of your application is to run COUNT(*)

haha. no. I see your point.

I'll be doing statistical functions on group-by's. So I'll go back
and give it another try with queries like those. And I'll use some of
the functions that other people here have suggeted. I hope this
works!

thanks.

#9tosbalok@gmail.com
tosbalok@gmail.com
In reply to: tosbalok@gmail.com (#1)
Re: Unacceptable postgres performance vs. Microsoft sqlserver

On Apr 14, 2:17 pm, david.t.wil...@gmail.com ("David Wilson") wrote:

For instance, your count of distinct userids is probably not using the
index you just created. If it still isn't using it after you ANALYZE
the table, try rewriting the query using group by (select count(*)
from (select userid from mytable group by userid) tmp). I recently had
a similar performance issue on a 75m row table, and the above helped.

thanks. There's a lot of good info in your post. I'll give it a try.