Parallel Select query performance and shared buffers

Started by Metin Dosluabout 12 years ago27 messages

metin@citusdata.com

about 12 years ago

We have several independent tables on a multi-core machine serving Select
queries. These tables fit into memory; and each Select queries goes over
one table's pages sequentially. In this experiment, there are no indexes or
table joins.

When we send concurrent Select queries to these tables, query performance
doesn't scale out with the number of CPU cores. We find that complex Select
queries scale out better than simpler ones. We also find that increasing
the block size from 8 KB to 32 KB, or increasing shared_buffers to include
the working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from
Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and
set shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator
from the industry standard TPC-H benchmark. Each table we generated, called
lineitem-1, lineitem-2, etc., had about 750 MB of data. Next, we sent 1, 2,
4, and 8 concurrent Select queries to these tables to observe the scale out
behavior. Our expectation was that since this machine had 8 cores, our run
times would stay constant all throughout. Also, we would have expected the
machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither
of those assumptions held true.

We found that query run times degraded as we increased the number of
concurrent Select queries. Also, CPU utilization flattened out at less than
50% for the simpler queries. Full results with block size of 8KB are below:

Table select count(*) TPC-H Simple (#6)[2]http://examples.citusdata.com/tpch_queries.html#query-6
TPC-H Complex (#1)[1]http://examples.citusdata.com/tpch_queries.html#query-1
1 Table / 1 query 1.5 s 2.5 s
8.4 s
2 Tables / 2 queries 1.5 s 2.5 s
8.4 s
4 Tables / 4 queries 2.0 s 2.9 s
8.8 s
8 Tables / 8 queries 3.3 s 4.0 s
9.6 s

We then increased the block size (BLCKSZ) from 8 KB to 32 KB and recompiled
PostgreSQL. This change had a positive impact on query completion times.
Here are the new results with block size of 32 KB:

Table select count(*) TPC-H Simple (#6)[2]http://examples.citusdata.com/tpch_queries.html#query-6
TPC-H Complex (#1)[1]http://examples.citusdata.com/tpch_queries.html#query-1
1 Table / 1 query 1.5 s 2.3 s
8.0 s
2 Tables / 2 queries 1.5 s 2.3 s
8.0 s
4 Tables / 4 queries 1.6 s 2.4 s
8.1 s
8 Tables / 8 queries 1.8 s 2.7 s
8.3 s

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse
there. (We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Unfortunately, increasing the block size from 8 to 32 KB has other
implications for some of our customers. Could you help us out with the
problem here?

What can we do to identify the problem's root cause? Can we work around it?

Thank you,
Metin

[1]: http://examples.citusdata.com/tpch_queries.html#query-1
[2]: http://examples.citusdata.com/tpch_queries.html#query-6

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Metin Doslu (#1)

Re: Parallel Select query performance and shared buffers

On Tue, Dec 3, 2013 at 7:11 PM, Metin Doslu <metin@citusdata.com> wrote:

We have several independent tables on a multi-core machine serving Select
queries. These tables fit into memory; and each Select queries goes over one
table's pages sequentially. In this experiment, there are no indexes or
table joins.

When we send concurrent Select queries to these tables, query performance
doesn't scale out with the number of CPU cores. We find that complex Select
queries scale out better than simpler ones. We also find that increasing the
block size from 8 KB to 32 KB, or increasing shared_buffers to include the
working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from
Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set
shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator
from the industry standard TPC-H benchmark. Each table we generated, called
lineitem-1, lineitem-2, etc., had about 750 MB of data.

I think all of this data cannot fit in shared_buffers, you might
want to increase shared_buffers
to larger size (not 30GB but close to your data size) to see how it behaves

Next, we sent 1, 2,
4, and 8 concurrent Select queries to these tables to observe the scale out
behavior. Our expectation was that since this machine had 8 cores, our run
times would stay constant all throughout. Also, we would have expected the
machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither
of those assumptions held true.

You queries have Aggregation, ORDER/GROUP BY, so there is a chance
that I/O can happen for those operation's
if PG doesn't have sufficient memory (work_mem) to perform such operation.

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse there.
(We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Instead of 30GB, you can try with lesser value, but it should be close
to your data size.

Unfortunately, increasing the block size from 8 to 32 KB has other
implications for some of our customers. Could you help us out with the
problem here?

What can we do to identify the problem's root cause? Can we work around it?

I think without finding the real cause, it would be difficult to get
the reasonable workaround.
Can you simplify your queries (simple scan or in other words no
aggregation or other things) to see how
they behave in your env., once you are able to see simple queries
scaling as per your expectation, you
can try with complex one's.

Note - post this on pgsql-performance as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Claudio Freire

klaussfreire@gmail.com

about 12 years ago

In reply to: Amit Kapila (#2)

Re: Parallel Select query performance and shared buffers

On Wed, Dec 4, 2013 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse there.
(We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Instead of 30GB, you can try with lesser value, but it should be close
to your data size.

The OS cache should have provided a similar function.

In fact, larger shared buffers shouldn't have made a difference if the
main I/O pattern are sequential scans, because they use a ring buffer.

Can we have the explain analyze of those queries, postgres
configuration, perhaps vmstat output during execution?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Claudio Freire (#3)

Re: Parallel Select query performance and shared buffers

I think all of this data cannot fit in shared_buffers, you might want

to increase shared_buffers

to larger size (not 30GB but close to your data size) to see how it

behaves.

When I use shared_buffers larger than my data size such as 10 GB, results
scale nearly as expected at least for this instance type.

You queries have Aggregation, ORDER/GROUP BY, so there is a chance
that I/O can happen for those operation's
if PG doesn't have sufficient memory (work_mem) to perform such operation.

I used work_mem as 32 MB, this should be enough for these queries. I also
tested with higher values of work_mem, and didn't obverse any difference.

Can you simplify your queries (simple scan or in other words no
aggregation or other things) to see how
they behave in your env., once you are able to see simple queries
scaling as per your expectation, you
can try with complex one's.

Actually we observe problem when queries start to get simpler such as
select count(*). Here is the results table in more compact format:

select count(*) TPC-H Simple(#6) TPC-H Complex(#1)
1 Table / 1 query 1.5 s 2.5 s 8.4 s
2 Tables/ 2 queries 1.5 s 2.5 s 8.4 s
4 Tables/ 4 queries 2.0 s 2.9 s 8.8 s
8 Tables/ 8 queries 3.3 s 4.0 s 9.6 s

Can we have the explain analyze of those queries, postgres
configuration, perhaps vmstat output during execution?

postgres=# explain analyze SELECT count(*) from lineitem_1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=199645.01..199645.02 rows=1 width=0) (actual
time=11317.391..11317.393 rows=1 loops=1)
-> Seq Scan on lineitem_1 (cost=0.00..184641.81 rows=6001281 width=0)
(actual time=0.011..5805.255 rows=6001215 loops=1)
Total runtime: 11317.440 ms
(3 rows)

postgres=# explain analyze SELECT
postgres-# sum(l_extendedprice * l_discount) as revenue
postgres-# FROM
postgres-# lineitem_1
postgres-# WHERE
postgres-# l_shipdate >= date '1994-01-01'
postgres-# AND l_shipdate < date '1994-01-01' + interval '1' year
postgres-# AND l_discount between 0.06 - 0.01 AND 0.06 + 0.01
postgres-# AND l_quantity < 24;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=260215.36..260215.37 rows=1 width=16) (actual
time=1751.775..1751.776 rows=1 loops=1)
-> Seq Scan on lineitem_1 (cost=0.00..259657.82 rows=111508 width=16)
(actual time=0.031..1630.449 rows=114160 loops=1)
Filter: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate <
'1995-01-01 00:00:00'::timestamp without time zone) AND (l_discount >=
0.05::double precision) AND (l_discount <= 0.07::double precision) AND
(l_quantity < 24::double precision))
Rows Removed by Filter: 5887055
Total runtime: 1751.830 ms
(5 rows)

postgres=# explain analyze SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem_1
WHERE
l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=436342.68..436342.69 rows=6 width=36) (actual
time=18720.932..18720.936 rows=4 loops=1)
Sort Key: l_returnflag, l_linestatus
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=436342.49..436342.60 rows=6 width=36) (actual
time=18720.887..18720.892 rows=4 loops=1)
-> Seq Scan on lineitem_1 (cost=0.00..199645.01 rows=5917437
width=36) (actual time=0.011..6754.619 rows=5916591 loops=1)
Filter: (l_shipdate <= '1998-09-02 00:00:00'::timestamp
without time zone)
Rows Removed by Filter: 84624
Total runtime: 18721.021 ms
(8 rows)

Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
(#6) queries: Although there is no need for I/O, "wa" fluctuates between 0
and 1.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
0 0 0 30093568 84892 38723896 0 0 0 0 22 14 0 0
100 0 0
8 1 0 30043056 84892 38723896 0 0 0 0 27080 52708 16
14 70 0 0
8 1 0 30006600 84892 38723896 0 0 0 0 44952 118286 43
44 12 1 0
8 0 0 29986264 84900 38723896 0 0 0 20 28043 95934 49
42 8 1 0
7 0 0 29991976 84900 38723896 0 0 0 0 8308 73641 52
42 6 0 0
0 0 0 30091828 84900 38723896 0 0 0 0 3996 30978 23
24 53 0 0
0 0 0 30091968 84900 38723896 0 0 0 0 17 23 0 0
100 0 0

I installed PostgreSQL 9.3.1 from source and in postgres configuration file
I only changed shared buffers (4 GB) and work_mem (32 MB).

Claudio Freire

klaussfreire@gmail.com

about 12 years ago

In reply to: Metin Doslu (#4)

Re: [HACKERS] Parallel Select query performance and shared buffers

On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:

Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
(#6) queries: Although there is no need for I/O, "wa" fluctuates between 0
and 1.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 30093568 84892 38723896 0 0 0 0 22 14 0 0 100 0 0
8 1 0 30043056 84892 38723896 0 0 0 0 27080 52708 16 14 70 0 0
8 1 0 30006600 84892 38723896 0 0 0 0 44952 118286 43 44 12 1 0
8 0 0 29986264 84900 38723896 0 0 0 20 28043 95934 49 42 8 1 0
7 0 0 29991976 84900 38723896 0 0 0 0 8308 73641 52 42 6 0 0
0 0 0 30091828 84900 38723896 0 0 0 0 3996 30978 23 24 53 0 0
0 0 0 30091968 84900 38723896 0 0 0 0 17 23 0 0 100 0 0

Notice the huge %sy

What kind of VM are you using? HVM or paravirtual?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Claudio Freire (#5)

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-04 14:27:10 -0200, Claudio Freire wrote:

On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:

Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
(#6) queries: Although there is no need for I/O, "wa" fluctuates between 0
and 1.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 30093568 84892 38723896 0 0 0 0 22 14 0 0 100 0 0
8 1 0 30043056 84892 38723896 0 0 0 0 27080 52708 16 14 70 0 0
8 1 0 30006600 84892 38723896 0 0 0 0 44952 118286 43 44 12 1 0
8 0 0 29986264 84900 38723896 0 0 0 20 28043 95934 49 42 8 1 0
7 0 0 29991976 84900 38723896 0 0 0 0 8308 73641 52 42 6 0 0
0 0 0 30091828 84900 38723896 0 0 0 0 3996 30978 23 24 53 0 0
0 0 0 30091968 84900 38723896 0 0 0 0 17 23 0 0 100 0 0

Notice the huge %sy

My bet is on transparent hugepage defragmentation. Alternatively it's
scheduler overhead, due to superflous context switches around the buffer
mapping locks.

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Claudio Freire (#5)

Re: Parallel Select query performance and shared buffers

Notice the huge %sy
What kind of VM are you using? HVM or paravirtual?

This instance is paravirtual.

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#6)

Re: Parallel Select query performance and shared buffers

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Here is one example:

+  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
+   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
+   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at
+   5.83%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
+   2.06%  postgres  [kernel.kallsyms]   [k] file_read_actor
+   1.89%  postgres  postgres            [.] heapgettup_pagemode
+   1.83%  postgres  postgres            [.] hash_search_with_hash_value
+   1.33%  postgres  [kernel.kallsyms]   [k] get_phys_to_machine
+   1.25%  postgres  [kernel.kallsyms]   [k] find_get_page
+   1.00%  postgres  postgres            [.] heapgetpage
+   0.99%  postgres  [kernel.kallsyms]   [k] radix_tree_lookup_element
+   0.98%  postgres  postgres            [.] advance_aggregates
+   0.96%  postgres  postgres            [.] ExecProject
+   0.94%  postgres  postgres            [.] advance_transition_function
+   0.88%  postgres  postgres            [.] ExecScan
+   0.87%  postgres  postgres            [.] HeapTupleSatisfiesMVCC
+   0.86%  postgres  postgres            [.] LWLockAcquire
+   0.82%  postgres  [kernel.kallsyms]   [k] put_page
+   0.82%  postgres  postgres            [.] MemoryContextReset
+   0.80%  postgres  postgres            [.] SeqNext
+   0.78%  postgres  [kernel.kallsyms]   [k] pte_mfn_to_pfn
+   0.69%  postgres  postgres            [.] ExecClearTuple
+   0.57%  postgres  postgres            [.] ExecProcNode
+   0.54%  postgres  postgres            [.] heap_getnext
+   0.53%  postgres  postgres            [.] LWLockRelease
+   0.53%  postgres  postgres            [.] ExecStoreTuple
+   0.51%  postgres  libc-2.12.so        [.] __GI___libc_read
+   0.42%  postgres  [kernel.kallsyms]   [k] xen_spin_lock
+   0.40%  postgres  postgres            [.] ReadBuffer_common
+   0.38%  postgres  [kernel.kallsyms]   [k] __do_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] shmem_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] unmap_single_vma
+   0.35%  postgres  [kernel.kallsyms]   [k] __wake_up_bit
+   0.33%  postgres  postgres            [.] StrategyGetBuffer
+   0.33%  postgres  [kernel.kallsyms]   [k] set_page_dirty
+   0.33%  postgres  [kernel.kallsyms]   [k] handle_pte_fault
+   0.33%  postgres  postgres            [.] ExecAgg
+   0.31%  postgres  postgres            [.] XidInMVCCSnapshot
+   0.31%  postgres  [kernel.kallsyms]   [k] __audit_syscall_entry
+   0.31%  postgres  postgres            [.] CheckForSerializableConflictOut
+   0.29%  postgres  [kernel.kallsyms]   [k] handle_mm_fault
+   0.25%  postgres  [kernel.kallsyms]   [k] shmem_getpage_gfp

On Wed, Dec 4, 2013 at 6:33 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Show quoted text

On 2013-12-04 14:27:10 -0200, Claudio Freire wrote:

On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu <metin@citusdata.com> wrote:

Here are the results of "vmstat 1" while running 8 parallel TPC-H

Simple

(#6) queries: Although there is no need for I/O, "wa" fluctuates

between 0

and 1.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in

cs us sy id wa st

0 0 0 30093568 84892 38723896 0 0 0 0 22

14 0 0 100 0 0

8 1 0 30043056 84892 38723896 0 0 0 0 27080

52708 16 14 70 0 0

8 1 0 30006600 84892 38723896 0 0 0 0 44952

118286 43 44 12 1 0

8 0 0 29986264 84900 38723896 0 0 0 20 28043

95934 49 42 8 1 0

7 0 0 29991976 84900 38723896 0 0 0 0 8308

73641 52 42 6 0 0

0 0 0 30091828 84900 38723896 0 0 0 0 3996

30978 23 24 53 0 0

0 0 0 30091968 84900 38723896 0 0 0 0 17

23 0 0 100 0 0

Notice the huge %sy

My bet is on transparent hugepage defragmentation. Alternatively it's
scheduler overhead, due to superflous context switches around the buffer
mapping locks.

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Metin Doslu (#8)

Re: Parallel Select query performance and shared buffers

On 2013-12-04 18:43:35 +0200, Metin Doslu wrote:

I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Here is one example:
+  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
+   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
+   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at

All that time is spent in your virtualization solution. One thing to try
is to look on the host system, sometimes profiles there can be more
meaningful.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Claudio Freire

klaussfreire@gmail.com

about 12 years ago

In reply to: Andres Freund (#9)

Re: Parallel Select query performance and shared buffers

On Wed, Dec 4, 2013 at 1:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-12-04 18:43:35 +0200, Metin Doslu wrote:
I'd strongly suggest doing a "perf record -g -a <wait a bit, ctrl-c>;
perf report" run to check what's eating up the time.

Here is one example:
+  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
+   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
+   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at
All that time is spent in your virtualization solution. One thing to try
is to look on the host system, sometimes profiles there can be more
meaningful.

You cannot profile the host on EC2.

You could try HVM. I've noticed it fare better under heavy CPU load,
and it's not fully-HVM (it still uses paravirtualized network and
I/O).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Claudio Freire (#10)

Re: [HACKERS] Parallel Select query performance and shared buffers

You could try HVM. I've noticed it fare better under heavy CPU load,
and it's not fully-HVM (it still uses paravirtualized network and
I/O).

I already tried with HVM (cc2.8xlarge instance on Amazon EC2) and observed
same problem.

#12

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Claudio Freire (#10)

Re: Parallel Select query performance and shared buffers

On 2013-12-04 16:00:40 -0200, Claudio Freire wrote:

On Wed, Dec 4, 2013 at 1:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

All that time is spent in your virtualization solution. One thing to try
is to look on the host system, sometimes profiles there can be more
meaningful.

You cannot profile the host on EC2.

Didn't follow the thread from the start. So, this is EC2? Have you
checked, with a recent enough version of top or whatever, how much time
is reported as "stolen"?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#12)

Re: Parallel Select query performance and shared buffers

Didn't follow the thread from the start. So, this is EC2? Have you
checked, with a recent enough version of top or whatever, how much time
is reported as "stolen"?

Yes, this EC2. "stolen" is randomly reported as 1, mostly as 0.

#14

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Metin Doslu (#1)

Re: Parallel Select query performance and shared buffers

Here are some extra information:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

- I tried this test with 4 core machines including my personel computer and
some other instances on Amazon EC2, I didn't see this problem with 4 core
machines. I started to see this problem in PostgreSQL when core count is 8
or more.

- Here are the results of "vmstat 1" while running 8 parallel select
count(*). Normally I would expect zero idle time.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
0 0 0 29838640 94000 38954740 0 0 0 0 22 21 0 0
100 0 0
7 2 0 29788416 94000 38954740 0 0 0 0 53922 108490 14
24 60 1 1
5 0 0 29747248 94000 38954740 0 0 0 0 68008 164571 22
48 27 2 1
8 0 0 29725796 94000 38954740 0 0 0 0 43587 150574 28
54 16 1 1
0 0 0 29838328 94000 38954740 0 0 0 0 15584 100459 26
55 18 1 0
0 0 0 29838328 94000 38954740 0 0 0 0 42 15 0 0
100 0 0

- When I run 8 parallel wc command or other scripts, they scale out as
expected and they utilize all cpu. This leads me to think that problem is
related with PostgreSQL instead of OS.

#15

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Metin Doslu (#14)

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#16

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#15)

Re: Parallel Select query performance and shared buffers

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

I'm testing with PostgreSQL 9.3.1.

#17

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Claudio Freire (#3)

Re: Parallel Select query performance and shared buffers

On Wed, Dec 4, 2013 at 10:40 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Wed, Dec 4, 2013 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse there.
(We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Instead of 30GB, you can try with lesser value, but it should be close
to your data size.

The OS cache should have provided a similar function.

The performance cannot be same when those pages are in shared buffers as
a. OS can flush those pages
b. anyway loading it again in shared buffers will have some overhead.

In fact, larger shared buffers shouldn't have made a difference if the
main I/O pattern are sequential scans, because they use a ring buffer.

Yeah, this is right, but then why he is able to see scaling when he
increased shared buffer's
to larger value.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Metin Doslu (#14)

Re: Parallel Select query performance and shared buffers

On Wed, Dec 4, 2013 at 11:49 PM, Metin Doslu <metin@citusdata.com> wrote:

Here are some extra information:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

I think here there is a good chance of improvement with the patch
suggested by Andres in this thread, but
still i think it might not completely resolve the current problem as
there will be overhead of associating data
with shared buffers.

Currently NUM_BUFFER_PARTITIONS is fixed, so may be auto tuning it
based on some parameter's can
help such situations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#19

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Metin Doslu (#14)

Re: Parallel Select query performance and shared buffers

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
16:

$ tail -f /tmp/logfile | grep lwlock | egrep -v "blk 0"
...
PID 15965 lwlock 0: shacq 0 exacq 33 blk 2
PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192
PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426
PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959
PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370
PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980
PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719
PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273
PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201
PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904
PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740
PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616
PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431
PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345
PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106
PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155
PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989

After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is
decreased:
...
PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1
PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1
PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1
PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1
PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1
PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2
PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1
PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1
PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1
PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1
PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1
PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1
PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3
PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1
PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1
PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1
PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1

#20

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Metin Doslu (#19)

Re: Parallel Select query performance and shared buffers

On 2013-12-05 11:15:20 +0200, Metin Doslu wrote:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
16:

Is your workload bigger than RAM? I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#21

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#20)

Re: Parallel Select query performance and shared buffers

Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by "halfway cached" and "spinlock
protecting the lwlocks".

#22

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Metin Doslu (#21)

Re: Parallel Select query performance and shared buffers

On 2013-12-05 11:33:29 +0200, Metin Doslu wrote:

Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by "halfway cached"

Well, your stats showed a) fairly low lock counts overall b) a high
percentage of exclusive locks.
a) indicates the system wasn't running long.
b) tells me there were lots of changes to the buffer mapping - which
basically only happens if a buffer is placed or removed from
shared-buffers.

If your shared_buffers is big enough to contain most of the data you
shouldn't see many exclusive locks in comparison to the number of shared
locks.

and "spinlock protecting the lwlocks".

Every LWLock has an internal spinlock to protect its state. So whenever
somebody does a LWLockAcquire()/Release(), even if only in shared mode,
we currently acquire that spinlock, manipulate the LWLocks state, and
release the spinlock again. In lots of workloads that internal spinlock
is the contention point, not the lenght over which the lwlock is
held. Especially when they are mostly held in shared mode.

Makes sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#23

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#15)

Re: [HACKERS] Parallel Select query performance and shared buffers

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

I tried your patches on next link. As you suspect I didn't see any
improvements. I tested it on PostgreSQL 9.2 Stable.

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/REL9_2_STABLE-rwlock-contention

On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Show quoted text

On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#24

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Metin Doslu (#23)

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-05 17:46:44 +0200, Metin Doslu wrote:

I tried your patches on next link. As you suspect I didn't see any
improvements. I tested it on PostgreSQL 9.2 Stable.

You tested the correct branch, right? Which commit does "git rev-parse
HEAD" show?

But generally, as long as your profile hides all the important
information behind the hypervisor's cost, you're going to have a hard
time analyzing the problems. You really should try to reproduce the
problems on native hardware (as similar to the host hardware as
possible), to get accurate data. On CPU bound workloads that information
is often transportable to the virtual world.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#25

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#24)

Re: Parallel Select query performance and shared buffers

You tested the correct branch, right? Which commit does "git rev-parse
HEAD" show?

I applied last two patches manually on PostgreSQL 9.2 Stable.

#26

Metin Doslu

metin@citusdata.com

about 12 years ago

In reply to: Andres Freund (#20)

Re: Parallel Select query performance and shared buffers

From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention
on spinlocks.

PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0
PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0
PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0
PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0
PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0
PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0
PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0
PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0
PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0
PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0
PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0
PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0
PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0
PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0
PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0
PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0
PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0

#27

Claudio Freire

klaussfreire@gmail.com

about 12 years ago

In reply to: Metin Doslu (#26)

Re: Parallel Select query performance and shared buffers

On Thu, Dec 5, 2013 at 1:03 PM, Metin Doslu <metin@citusdata.com> wrote:

From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention
on spinlocks.

Did you check hugepages?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance