postgresql scalability issue

Started by umut orhanover 15 years ago9 messagesgeneral

umut_angelfire@yahoo.com

over 15 years ago

Hi all,

I've collected some interesting results during my experiments which I couldn't
figure out the reason behind them and need your assistance.

I'm running PostgreSQL 9.0 on a quad-core machine having two level on-chip cache
hierarchy. PostgreSQL has a large and warmed-up buffer
cache thus, no disk I/O is observed during experiments (i.e. for each query
buffer cache hit rate is 100%). I'm pinning each query/process to an individual
core. Queries are simple read-only queries (only selects). Nested loop (without
materialize) is used for the join operator.

When I pin a single query to an individual core, its execution time is observed
as 111 seconds. This result is my base case. Then, I fire two instances of the
same query concurrently and pin them to two different cores separately. However,
each execution time becomes 132 seconds in this case. In a similar trend,
execution times are increasing for three instances (164 seconds) and four
instances (201 seconds) cases too. What I was expecting is a linear improvement
in throughput (at least). I tried several different queries and got the same
trend at each time.

I wonder why execution times of individual queries are increasing when I
increase the number of their instances.

Btw, I don't think on-chip cache hit/miss rates make a difference since L2
cache misses are decreasing as expected. I'm not an expert in PostgreSQL
internals. Maybe there is a lock-contention (spinlocks?) occurring even if the
queries are read-only. Anyways, all ideas are welcome.

Thanks in advance,
Regards,
Umut

Scott Marlowe

scott.marlowe@gmail.com

over 15 years ago

In reply to: umut orhan (#1)

Re: postgresql scalability issue

On Mon, Nov 8, 2010 at 8:33 AM, umut orhan <umut_angelfire@yahoo.com> wrote:

Hi all,
I've collected some interesting results during my experiments which I
couldn't figure out the reason behind them and need your assistance.
I'm running PostgreSQL 9.0 on a quad-core machine having two level on-chip
cache hierarchy. PostgreSQL has a large and warmed-up buffer
cache thus, no disk I/O is observed during experiments (i.e. for each query
buffer cache hit rate is 100%). I'm pinning each query/process to an
individual core. Queries are simple read-only queries (only selects). Nested
loop (without materialize) is used for the join operator.
When I pin a single query to an individual core, its execution time is
observed as 111 seconds. This result is my base case. Then, I fire two
instances of the same query concurrently and pin them to two different cores
separately. However, each execution time becomes 132 seconds in this case.
In a similar trend, execution times are increasing for three instances (164
seconds) and four instances (201 seconds) cases too. What I was expecting is
a linear improvement in throughput (at least). I tried several different
queries and got the same trend at each time.
I wonder why execution times of individual queries are increasing when I
increase the number of their instances.
Btw, I don't think on-chip cache hit/miss rates make a difference since L2
cache misses are decreasing as expected. I'm not an expert in PostgreSQL
internals. Maybe there is a lock-contention (spinlocks?) occurring even if
the queries are read-only. Anyways, all ideas are welcome.

My guess would be it's memory contention. What architecture is your
quad core cpu?

John R Pierce

pierce@hogranch.com

over 15 years ago

In reply to: umut orhan (#1)

Re: postgresql scalability issue

On 11/08/10 7:33 AM, umut orhan wrote:

Hi all,

I've collected some interesting results during my experiments which I
couldn't figure out the reason behind them and need your assistance.

I'm running PostgreSQL 9.0 on a quad-core machine having two level
on-chip cache hierarchy. PostgreSQL has a large and warmed-up buffer
cache thus, no disk I/O is observed during experiments (i.e. for each
query buffer cache hit rate is 100%). I'm pinning each query/process
to an individual core. Queries are simple read-only queries (only
selects). Nested loop (without materialize) is used for the join operator.
....

did pinning the processes to CPU cores make any measurable difference ?

Matthieu Huin

matthieu.huin@wallix.com

over 15 years ago

In reply to: Scott Marlowe (#2)

temporary table as a subset of an existing table and indexes

Greetings all,

I am trying to optimize SELECT queries on a large table (10M rows and
more) by using temporary tables that are subsets of my main table, thus
narrowing the search space to a more manageable size.
Is it possible to transfer indices (or at least use the information from
existing indices) from the big table to its subset in a reasonable
amount of time ?

When I try :

CREATE TEMPORARY TABLE tmp AS
SELECT * FROM big_table WHERE condition;

The table creation is fast ( a few seconds ) as there are indices on the
big table that are optimized for condition, but then indexing the data
is rather costly (the new table would have around 100k rows) and takes a
few minutes to complete. This is not acceptable as the whole process
aims at reducing the query time.

I get even worse results with the following transaction :

CREATE TEMPORARY TABLE tmp ( LIKE big_table INCLUDING INDEXES );
INSERT INTO tmp SELECT * FROM big_table WHERE condition;

Also, partitioning my big table from the very beginning is not an
option, as it doesn't guarantee index key unicity ( according to
http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html ).

Any suggestions on this ?

Kind regards,

Matthieu Huin

Merlin Moncure

mmoncure@gmail.com

over 15 years ago

In reply to: Matthieu Huin (#4)

Re: temporary table as a subset of an existing table and indexes

On Mon, Nov 8, 2010 at 12:15 PM, Matthieu Huin <matthieu.huin@wallix.com> wrote:

Greetings all,

I am trying to optimize SELECT queries on a large table (10M rows and more)
by using temporary tables that are subsets of my main table, thus narrowing
the search space to a more manageable size.
Is it possible to transfer indices (or at least use the information from
existing indices) from the big table to its subset in a reasonable amount of
time ?

Are you sure that the benefit of creating scratch tables is worth the
overhead? Can you give explain/analyze of the query you are trying
to optimize?

merlin

Matthieu Huin

matthieu.huin@wallix.com

over 15 years ago

In reply to: Merlin Moncure (#5)

Re: temporary table as a subset of an existing table and indexes

Hello Merlin,

So far the improvement in responsiveness has been very noticeable, even
without indexing the temporary tables. Of course, this is just trading
accuracy for speed as I simply narrow arbitrarily the search space ...

The schema I am working on is close to the one I am referencing in this
thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php

Since we want to implement full text search and tags querying, it can
lead to rather complex autogenerated queries such as this one (find log
lines with the word 'root' in it, dated from 11/04 to 11/06, where the
'program' tag is sshd and the 'severity_code' tag is less than 3) :

EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid,
lcond84.date FROM
( SELECT tmp84.logid, tmp84.date FROM logs tmp84
WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@
plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04
10:22:06.26' AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84
NATURAL JOIN
( SELECT tmp85.logid FROM tags tmp85 WHERE
FALSE
OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) )
OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) )
GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 )

)) AS r ORDER BY r.date DESC LIMIT 1000;

Giving the following query plan :

Limit (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.257..34744.257 rows=0 loops=1)
-> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.255..34744.255 rows=0 loops=1)
Sort Key: tmp84.date
Sort Method: quicksort Memory: 17kB
-> Hash Join (cost=765005.46..765445.40 rows=9 width=16)
(actual time=34744.202..34744.202 rows=0 loops=1)
Hash Cond: (tmp85.logid = tmp84.logid)
-> HashAggregate (cost=758440.29..758669.77 rows=15299
width=8) (actual time=33343.816..33343.816 rows=0 loops=1)
Filter: (count(tmp85.logid) = 2)
-> Bitmap Heap Scan on tags tmp85
(cost=92363.26..757225.45 rows=242968 width=8) (actual
time=20676.354..33294.252 rows=32864 loops=1)
Recheck Cond: ((name =
'severity_code'::text) OR (name = 'program'::text))
Filter: (((name = 'severity_code'::text) AND
num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND
((value).storedvalue = 'sshd'::text)))
-> BitmapOr (cost=92363.26..92363.26
rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..30388.35 rows=707841 width=0) (actual
time=19337.358..19337.358 rows=708719 loops=1)
Index Cond: (name =
'severity_code'::text)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551
rows=1484703 loops=1)
Index Cond: (name = 'program'::text)
-> Hash (cost=6553.06..6553.06 rows=969 width=16)
(actual time=1400.378..1400.378 rows=32516 loops=1)
-> Bitmap Heap Scan on logs tmp84
(cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745
rows=32516 loops=1)
Recheck Cond:
(to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
Filter: ((date > '2010-11-04
10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06
10:22:06.26+01'::timestamp with time zone))
-> Bitmap Index Scan on fulltext_body_idx
(cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874
rows=64340 loops=1)
Index Cond:
(to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
Total runtime: 34756.938 ms

This one isn't too bad, but the runtime seems to increase exponentially
with the tables size. Therefore, using a temporary table based on the
date condition can cut the query time by a factor of up to ten (table
creation included, and provided the resulting table isn't too big - I
make a COUNT check prior to creation so that I will eventually limit
manually the table size.). But of course, I'd rather have speed AND
accuracy ...

To make things worse, the tables tend to grow very quickly since as you
might have guessed, I am working on the database part of a logs
collector; the current implementation doesn't scale well along the data.

I hope this makes things clearer. Feel free to ask if you need more
clarifications, and thanks for your time.

Matthieu

Le 08/11/2010 22:26, Merlin Moncure a ï¿½crit :

Show quoted text

On Mon, Nov 8, 2010 at 12:15 PM, Matthieu Huin<matthieu.huin@wallix.com> wrote:

Greetings all,

I am trying to optimize SELECT queries on a large table (10M rows and more)
by using temporary tables that are subsets of my main table, thus narrowing
the search space to a more manageable size.
Is it possible to transfer indices (or at least use the information from
existing indices) from the big table to its subset in a reasonable amount of
time ?

Are you sure that the benefit of creating scratch tables is worth the
overhead? Can you give explain/analyze of the query you are trying
to optimize?

merlin

Ivan Voras

ivoras@freebsd.org

over 15 years ago

In reply to: umut orhan (#1)

Re: postgresql scalability issue

On 11/08/10 16:33, umut orhan wrote:

Hi all,

I've collected some interesting results during my experiments which I couldn't
figure out the reason behind them and need your assistance.

I'm running PostgreSQL 9.0 on a quad-core machine having two level on-chip cache
hierarchy.

Based on this information, you are most likely running on Intel Xeon
5000-5400 series CPU, right? It probably doesn't matter much since you
apparently have only a single socket populated but is a bit old
architecture known for its FSB bottleneck to the memory.

You should write some details about your hardware: at least CPU
model/speed and memory speed, and software (which OS? 32-bit or 64-bit?)

PostgreSQL has a large and warmed-up buffer
cache thus, no disk I/O is observed during experiments (i.e. for each query
buffer cache hit rate is 100%). I'm pinning each query/process to an individual
core. Queries are simple read-only queries (only selects). Nested loop (without
materialize) is used for the join operator.

When I pin a single query to an individual core, its execution time is observed
as 111 seconds. This result is my base case. Then, I fire two instances of the
same query concurrently and pin them to two different cores separately. However,
each execution time becomes 132 seconds in this case. In a similar trend,
execution times are increasing for three instances (164 seconds) and four
instances (201 seconds) cases too. What I was expecting is a linear improvement
in throughput (at least). I tried several different queries and got the same
trend at each time.

Are you measuring wall-clock execution time for queries in parallel?
I.e. start measuring when the first query is started (asynchronously?)
and stop when the last one is finished?

Did you try the same measurement without pinning?

I wonder why execution times of individual queries are increasing when I
increase the number of their instances.

Btw, I don't think on-chip cache hit/miss rates make a difference since L2
cache misses are decreasing as expected. I'm not an expert in PostgreSQL
internals. Maybe there is a lock-contention (spinlocks?) occurring even if the
queries are read-only. Anyways, all ideas are welcome.

As others said, memory bandwidth is the most likely suspect here. CPUs
are unfortunately so much faster than memory and memory buses that they
frequently have to wait. Unless PostgreSQL uses the exclusive lock model
instead of shared-exclusive, there shouldn't be much contention for the
shared buffers.

Merlin Moncure

mmoncure@gmail.com

over 15 years ago

In reply to: Matthieu Huin (#6)

Re: temporary table as a subset of an existing table and indexes

On Tue, Nov 9, 2010 at 4:47 AM, Matthieu Huin <matthieu.huin@wallix.com> wrote:

Hello Merlin,

So far the improvement in responsiveness has been very noticeable, even
without indexing the temporary tables. Of course, this is just trading
accuracy for speed as I simply narrow arbitrarily the search space ...

The schema I am working on is close to the one I am referencing in this
thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php

Since we want to implement full text search and tags querying, it can lead
to rather complex autogenerated queries such as this one (find log lines
with the word 'root' in it, dated from 11/04 to 11/06, where the 'program'
tag is sshd and the 'severity_code' tag is less than 3) :

EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid, lcond84.date
FROM
( SELECT tmp84.logid, tmp84.date FROM logs tmp84
WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@
plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04 10:22:06.26'
AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84
NATURAL JOIN
( SELECT tmp85.logid FROM tags tmp85 WHERE
FALSE
OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) )
OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) )
GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 )

)) AS r ORDER BY r.date DESC LIMIT 1000;

Giving the following query plan :

Limit (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.257..34744.257 rows=0 loops=1)
-> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.255..34744.255 rows=0 loops=1)
Sort Key: tmp84.date
Sort Method: quicksort Memory: 17kB
-> Hash Join (cost=765005.46..765445.40 rows=9 width=16) (actual
time=34744.202..34744.202 rows=0 loops=1)
Hash Cond: (tmp85.logid = tmp84.logid)
-> HashAggregate (cost=758440.29..758669.77 rows=15299
width=8) (actual time=33343.816..33343.816 rows=0 loops=1)
Filter: (count(tmp85.logid) = 2)
-> Bitmap Heap Scan on tags tmp85
(cost=92363.26..757225.45 rows=242968 width=8) (actual
time=20676.354..33294.252 rows=32864 loops=1)
Recheck Cond: ((name = 'severity_code'::text) OR
(name = 'program'::text))
Filter: (((name = 'severity_code'::text) AND
num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND
((value).storedvalue = 'sshd'::text)))
-> BitmapOr (cost=92363.26..92363.26
rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..30388.35 rows=707841 width=0) (actual time=19337.358..19337.358
rows=708719 loops=1)
Index Cond: (name =
'severity_code'::text)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551
rows=1484703 loops=1)
Index Cond: (name = 'program'::text)
-> Hash (cost=6553.06..6553.06 rows=969 width=16) (actual
time=1400.378..1400.378 rows=32516 loops=1)
-> Bitmap Heap Scan on logs tmp84
(cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745
rows=32516 loops=1)
Recheck Cond: (to_tsvector('simple'::regconfig,
body) @@ '''root'''::tsquery)
Filter: ((date > '2010-11-04
10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06
10:22:06.26+01'::timestamp with time zone))
-> Bitmap Index Scan on fulltext_body_idx
(cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874
rows=64340 loops=1)
Index Cond:
(to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
Total runtime: 34756.938 ms

This one isn't too bad, but the runtime seems to increase exponentially with
the tables size. Therefore, using a temporary table based on the date
condition can cut the query time by a factor of up to ten (table creation
included, and provided the resulting table isn't too big - I make a COUNT
check prior to creation so that I will eventually limit manually the table
size.). But of course, I'd rather have speed AND accuracy ...

To make things worse, the tables tend to grow very quickly since as you
might have guessed, I am working on the database part of a logs collector;
the current implementation doesn't scale well along the data.

I hope this makes things clearer. Feel free to ask if you need more
clarifications, and thanks for your time.

How are you partitioning the tags? Is the partitioned query doing the
same job as the non partitioned query? Is date a forced criteria?
(and if it is, have you considered date partition/brute force?)

merlin

Matthieu Huin

matthieu.huin@wallix.com

over 15 years ago

In reply to: Merlin Moncure (#8)

Re: temporary table as a subset of an existing table and indexes

Basically, I take the same query as above and replace all occurences of
tables logs and tags with temp_logs and temp_tags, created as follow:

CREATE TEMPORARY TABLE temp_logs ON COMMIT DROP AS
SELECT * FROM logs WHERE condition ORDER BY date DESC LIMIT max_size;

CREATE TEMPORARY TABLE temp_tags ON COMMIT DROP AS
SELECT * FROM tags WHERE logid IN (SELECT logid FROM temp_logs);

With condition usually defining a date window. As we are experimenting
with this approach, date has become a forced criteria. I have
experimented with partitioning, but it led to the logid primary key not
being unique anymore, which was a problem when joining data with the
tags table.

So the queries are pretty much the same, the boost in speed being simply
due to the limitation of the search space.

Show quoted text

How are you partitioning the tags? Is the partitioned query doing the
same job as the non partitioned query? Is date a forced criteria?
(and if it is, have you considered date partition/brute force?)

merlin