intel s3500 -- hot stuff

Started by Merlin Moncureover 11 years ago31 messageshackers

mmoncure@gmail.com

over 11 years ago

I recently sourced a 300gb intel s3500 ssd to do some performance
testing. I didn't see a lot of results on the web so I thought I'd
post some numbers. Testing machine is my workstation crapbox with 4
cores and 8GB ram (of which about 4 is usable by the ~ 50gb database).
The drive cost 260$ at newegg (sub 1$/gb) and is write durable.

Single thread 'select only' results are pretty stable 2200 tps isn't
bad. of particular note is the sub millisecond latency of the read.
Per iostat I'm getting ~ 55mb/sec read off the device and around 4100
device tps:

transaction type: SELECT only
scaling factor: 3000
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 22061
tps = 2206.019701 (including connections establishing)
tps = 2206.534467 (excluding connections establishing)
statement latencies in milliseconds:
0.003143 \set naccounts 100000 * :scale
0.000776 \setrandom aid 1 :naccounts
0.447513 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Multi thread 'select only' results are also pretty stable: I get
around 16-17k tps, but of note:
*) iowait in the mid 40's
*) cpu bound
*) consistent 430mb/sec off the device per iostat !! that's
incredible!! (some of the latency may in fact be from SATA).

transaction type: SELECT only
scaling factor: 3000
query mode: simple
number of clients: 32
number of threads: 32
duration: 20 s
number of transactions actually processed: 321823
tps = 16052.052818 (including connections establishing)
tps = 16062.973737 (excluding connections establishing)
statement latencies in milliseconds:
0.002469 \set naccounts 100000 * :scale
0.000528 \setrandom aid 1 :naccounts
1.984443 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

For random write tests, I see around 1000tps for single thread and ~
4700 with 32 threads. These results are more volatile and,
importantly, I disable synchronous commit feature. For the price,
unless you are doing tons and tons of writing (in which case i'd opt
for a more expensive drive like the S3700). This drive is perfectly
suited for OLAP work IMO since ssds like the big sequential loads and
random access of the data is no problem.

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Merlin Moncure

mmoncure@gmail.com

over 11 years ago

In reply to: Merlin Moncure (#1)

Re: intel s3500 -- hot stuff

On Wed, Nov 5, 2014 at 11:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

I recently sourced a 300gb intel s3500 ssd to do some performance
testing. I didn't see a lot of results on the web so I thought I'd
post some numbers. Testing machine is my workstation crapbox with 4
cores and 8GB ram (of which about 4 is usable by the ~ 50gb database).
The drive cost 260$ at newegg (sub 1$/gb) and is write durable.

Here's another fascinating data point. I was playing around
effective_io_concurrency for the device with bitmap heap scans on the
scale 3000 database (again, the performance numbers are very stable
across runs):
bench=# explain (analyze, buffers) select * from pgbench_accounts
where aid between 1000 and 50000000 and abalance != 0;

QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Bitmap Heap Scan on pgbench_accounts (cost=1059541.66..6929604.57
rows=1 width=97) (actual time=5040.128..23089.651 rows=1420738
loops=1)
Recheck Cond: ((aid >= 1000) AND (aid <= 50000000))
Rows Removed by Index Recheck: 3394823
Filter: (abalance <> 0)
Rows Removed by Filter: 48578263
Buffers: shared hit=3 read=1023980
-> Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..1059541.66 rows=50532109 width=0) (actual
time=5038.707..5038.707 rows=49999001 loops=1)
Index Cond: ((aid >= 1000) AND (aid <= 50000000))
Buffers: shared hit=3 read=136611
Total runtime: 46251.375 ms

effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat
effective_io_concurrency 2: 49.3 sec, ~ 158 mb/sec peak via iostat
effective_io_concurrency 4: 29.1 sec, ~ 291 mb/sec peak via iostat
effective_io_concurrency 8: 23.2 sec, ~ 385 mb/sec peak via iostat
effective_io_concurrency 16: 22.1 sec, ~ 409 mb/sec peak via iostat
effective_io_concurrency 32: 20.7 sec, ~ 447 mb/sec peak via iostat
effective_io_concurrency 64: 20.0 sec, ~ 468 mb/sec peak via iostat
effective_io_concurrency 128: 19.3 sec, ~ 488 mb/sec peak via iostat
effective_io_concurrency 256: 19.2 sec, ~ 494 mb/sec peak via iostat

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Bruce Momjian

bruce@momjian.us

over 11 years ago

In reply to: Merlin Moncure (#2)

Re: intel s3500 -- hot stuff

On Wed, Nov 5, 2014 at 12:09:16PM -0600, Merlin Moncure wrote:

effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat
effective_io_concurrency 2: 49.3 sec, ~ 158 mb/sec peak via iostat
effective_io_concurrency 4: 29.1 sec, ~ 291 mb/sec peak via iostat
effective_io_concurrency 8: 23.2 sec, ~ 385 mb/sec peak via iostat
effective_io_concurrency 16: 22.1 sec, ~ 409 mb/sec peak via iostat
effective_io_concurrency 32: 20.7 sec, ~ 447 mb/sec peak via iostat
effective_io_concurrency 64: 20.0 sec, ~ 468 mb/sec peak via iostat
effective_io_concurrency 128: 19.3 sec, ~ 488 mb/sec peak via iostat
effective_io_concurrency 256: 19.2 sec, ~ 494 mb/sec peak via iostat

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

commit d86d51a95810caebcea587498068ff32fe28293e
Author: Robert Haas <rhaas@postgresql.org>
Date: Tue Jan 5 21:54:00 2010 +0000

Support ALTER TABLESPACE name SET/RESET ( tablespace_options ).

This patch only supports seq_page_cost and random_page_cost as parameters,
but it provides the infrastructure to scalably support many more.
In particular, we may want to add support for effective_io_concurrency,
but I'm leaving that as future work for now.

Thanks to Tom Lane for design help and Alvaro Herrera for the review.

It seems that time has come.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Merlin Moncure

mmoncure@gmail.com

over 11 years ago

In reply to: Bruce Momjian (#3)

Re: intel s3500 -- hot stuff

On Sat, Dec 6, 2014 at 7:08 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Nov 5, 2014 at 12:09:16PM -0600, Merlin Moncure wrote:

effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat
effective_io_concurrency 2: 49.3 sec, ~ 158 mb/sec peak via iostat
effective_io_concurrency 4: 29.1 sec, ~ 291 mb/sec peak via iostat
effective_io_concurrency 8: 23.2 sec, ~ 385 mb/sec peak via iostat
effective_io_concurrency 16: 22.1 sec, ~ 409 mb/sec peak via iostat
effective_io_concurrency 32: 20.7 sec, ~ 447 mb/sec peak via iostat
effective_io_concurrency 64: 20.0 sec, ~ 468 mb/sec peak via iostat
effective_io_concurrency 128: 19.3 sec, ~ 488 mb/sec peak via iostat
effective_io_concurrency 256: 19.2 sec, ~ 494 mb/sec peak via iostat

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

What I'd really like to see is to have effective_io_concurrency work
on other types of scans. It's clearly a barn burner on fast storage
and perhaps the default should be something other than '1'. Spinning
storage is clearly dead and ssd seem to really benefit from the posix
readhead api.

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Bruce Momjian

bruce@momjian.us

over 11 years ago

In reply to: Merlin Moncure (#4)

Re: intel s3500 -- hot stuff

On Mon, Dec 8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote:

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

What I'd really like to see is to have effective_io_concurrency work
on other types of scans. It's clearly a barn burner on fast storage
and perhaps the default should be something other than '1'. Spinning
storage is clearly dead and ssd seem to really benefit from the posix
readhead api.

Well, the real question is knowing which blocks to request before
actually needing them. With a bitmap scan, that is easy --- I am
unclear how to do it for other scans. We already have kernel read-ahead
for sequential scans, and any index scan that hits multiple rows will
probably already be using a bitmap heap scan.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

jeff.janes@gmail.com

over 11 years ago

In reply to: Bruce Momjian (#5)

Re: intel s3500 -- hot stuff

On Tue, Dec 9, 2014 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Dec 8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote:

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

What I'd really like to see is to have effective_io_concurrency work
on other types of scans. It's clearly a barn burner on fast storage
and perhaps the default should be something other than '1'. Spinning
storage is clearly dead and ssd seem to really benefit from the posix
readhead api.

I haven't played much with SSD, but effective_io_concurrency can be a big
win even on spinning disk.

Well, the real question is knowing which blocks to request before
actually needing them. With a bitmap scan, that is easy --- I am
unclear how to do it for other scans. We already have kernel read-ahead
for sequential scans, and any index scan that hits multiple rows will
probably already be using a bitmap heap scan.

If the index scan is used to provide ordering as well as selectivity than
it will resist being converted to an bitmap scan. Also it won't convert to
a bitmap scan solely to get credit for the use of effective_io_concurrency,
as that setting doesn't enter into planning decisions.

For a regular index scan, it should be easy to prefetch table blocks for
all the tuples that will need to be retrieved based on the current index
leaf page, for example. Looking ahead across leaf page boundaries would be
harder.

Cheers,

Jeff

Julien Rouhaud

rjuju123@gmail.com

almost 11 years ago

In reply to: Jeff Janes (#6)

Re: intel s3500 -- hot stuff

On 10/12/2014 17:52, Jeff Janes wrote:

On Tue, Dec 9, 2014 at 12:43 PM, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

On Mon, Dec 8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote:

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

What I'd really like to see is to have effective_io_concurrency work
on other types of scans. It's clearly a barn burner on fast storage
and perhaps the default should be something other than '1'. Spinning
storage is clearly dead and ssd seem to really benefit from the posix
readhead api.

I haven't played much with SSD, but effective_io_concurrency can be a
big win even on spinning disk.

Well, the real question is knowing which blocks to request before
actually needing them. With a bitmap scan, that is easy --- I am
unclear how to do it for other scans. We already have kernel read-ahead
for sequential scans, and any index scan that hits multiple rows will
probably already be using a bitmap heap scan.

If the index scan is used to provide ordering as well as selectivity
than it will resist being converted to an bitmap scan. Also it won't
convert to a bitmap scan solely to get credit for the use of
effective_io_concurrency, as that setting doesn't enter into planning
decisions.

For a regular index scan, it should be easy to prefetch table blocks for
all the tuples that will need to be retrieved based on the current index
leaf page, for example. Looking ahead across leaf page boundaries would
be harder.

I also think that having effective_io_concurrency for other nodes that
bitmap scan would be really great, but for now
having a per-tablespace effective_io_concurrency is simpler to implement
and will already help, so here's a patch to implement it. I'm also
adding it to the next commitfest.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Julien Rouhaud

rjuju123@gmail.com

almost 11 years ago

In reply to: Julien Rouhaud (#7)

Re: [PERFORM] intel s3500 -- hot stuff

On 18/07/2015 12:03, Julien Rouhaud wrote:

On 10/12/2014 17:52, Jeff Janes wrote:

On Tue, Dec 9, 2014 at 12:43 PM, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

On Mon, Dec 8, 2014 at 03:40:43PM -0600, Merlin Moncure wrote:

Did not see consistent measurable gains > 256
effective_io_concurrency. Interesting that at setting of '2' (the
lowest possible setting with the feature actually working) is
pessimal.

Very interesting. When we added a per-tablespace random_page_cost,
there was a suggestion that we might want to add per-tablespace
effective_io_concurrency someday:

What I'd really like to see is to have effective_io_concurrency work
on other types of scans. It's clearly a barn burner on fast storage
and perhaps the default should be something other than '1'. Spinning
storage is clearly dead and ssd seem to really benefit from the posix
readhead api.

I haven't played much with SSD, but effective_io_concurrency can be a
big win even on spinning disk.

Well, the real question is knowing which blocks to request before
actually needing them. With a bitmap scan, that is easy --- I am
unclear how to do it for other scans. We already have kernel read-ahead
for sequential scans, and any index scan that hits multiple rows will
probably already be using a bitmap heap scan.

If the index scan is used to provide ordering as well as selectivity
than it will resist being converted to an bitmap scan. Also it won't
convert to a bitmap scan solely to get credit for the use of
effective_io_concurrency, as that setting doesn't enter into planning
decisions.

For a regular index scan, it should be easy to prefetch table blocks for
all the tuples that will need to be retrieved based on the current index
leaf page, for example. Looking ahead across leaf page boundaries would
be harder.

I also think that having effective_io_concurrency for other nodes that
bitmap scan would be really great, but for now
having a per-tablespace effective_io_concurrency is simpler to implement
and will already help, so here's a patch to implement it. I'm also
adding it to the next commitfest.

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.

Sorry the double post.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Julien Rouhaud (#8)

Re: Allow a per-tablespace effective_io_concurrency setting

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to something
fitting.

@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
{
BitmapHeapScanState *scanstate;
Relation	currentRelation;
+#ifdef USE_PREFETCH
+	int new_io_concurrency;
+#endif

/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
*/
currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);

+#ifdef USE_PREFETCH
+	/* check if the effective_io_concurrency has been overloaded for the
+	 * tablespace storing the relation and compute the target_prefetch_pages,
+	 * or just get the current target_prefetch_pages
+	 */
+	new_io_concurrency = get_tablespace_io_concurrency(
+			currentRelation->rd_rel->reltablespace);
+
+
+	scanstate->target_prefetch_pages = target_prefetch_pages;
+
+	if (new_io_concurrency != effective_io_concurrency)
+	{
+		double prefetch_pages;
+	   if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
+			scanstate->target_prefetch_pages = rint(prefetch_pages);
+	}
+#endif

Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine? Especially changing the size of externally
visible structs depending on a configure detected ifdef seems wrong to
me.

+bool
+compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
+{
+	double		new_prefetch_pages = 0.0;
+	int			i;
+
+	/* make sure the io_concurrency value is correct, it may have been forced
+	 * with a pg_tablespace UPDATE
+	 */

Nitpick: Wrong comment style (/* stands on its own line).

+	if (io_concurrency > MAX_IO_CONCURRENCY)
+		io_concurrency = MAX_IO_CONCURRENCY;
+
+	/*----------
+	 * The user-visible GUC parameter is the number of drives (spindles),
+	 * which we need to translate to a number-of-pages-to-prefetch target.
+	 * The target value is stashed in *extra and then assigned to the actual
+	 * variable by assign_effective_io_concurrency.
+	 *
+	 * The expected number of prefetch pages needed to keep N drives busy is:
+	 *
+	 * drives |   I/O requests
+	 * -------+----------------
+	 *		1 |   1
+	 *		2 |   2/1 + 2/2 = 3
+	 *		3 |   3/1 + 3/2 + 3/3 = 5 1/2
+	 *		4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+	 *		n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
this. bufmgr.c maybe?

You also didn't touch
/*
* How many buffers PrefetchBuffer callers should try to stay ahead of their
* ReadBuffer calls by. This is maintained by the assign hook for
* effective_io_concurrency. Zero means "never prefetch".
*/
int target_prefetch_pages = 0;
which surely doesn't make sense anymore after these changes.

But do we even need that variable now?

diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index dc167f9..57008fc 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -26,6 +26,9 @@
#define MAX_KILOBYTES	(INT_MAX / 1024)
#endif

+/* upper limit for effective_io_concurrency */
+#define MAX_IO_CONCURRENCY 1000
+
/*
* Automatic configuration file name for ALTER SYSTEM.
* This file will be used to store values of configuration parameters
@@ -256,6 +259,8 @@ extern int	temp_file_limit;

extern int num_temp_buffers;

+extern int effective_io_concurrency;
+

target_prefetch_pages is declared in bufmgr.h - that seems like a better
place for these.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Andres Freund (#9)

Re: Allow a per-tablespace effective_io_concurrency setting

On 09/02/2015 03:53 PM, Andres Freund wrote:

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to something
fitting.
@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
{
BitmapHeapScanState *scanstate;
Relation	currentRelation;
+#ifdef USE_PREFETCH
+	int new_io_concurrency;
+#endif
/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
*/
currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
+#ifdef USE_PREFETCH
+	/* check if the effective_io_concurrency has been overloaded for the
+	 * tablespace storing the relation and compute the target_prefetch_pages,
+	 * or just get the current target_prefetch_pages
+	 */
+	new_io_concurrency = get_tablespace_io_concurrency(
+			currentRelation->rd_rel->reltablespace);
+
+
+	scanstate->target_prefetch_pages = target_prefetch_pages;
+
+	if (new_io_concurrency != effective_io_concurrency)
+	{
+		double prefetch_pages;
+	   if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
+			scanstate->target_prefetch_pages = rint(prefetch_pages);
+	}
+#endif
Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine?

It's not just you. Dealing with code with plenty of ifdefs is annoying -
it compiles just fine most of the time, until you compile it with some
rare configuration. Then it either starts producing strange warnings or
the compilation fails entirely.

It might make a tiny difference on builds without prefetching support
because of code size, but realistically I think it's noise.

Especially changing the size of externally visible structs depending
ona configure detected ifdef seems wrong to me.

+100 to that

+	/*----------
+	 * The user-visible GUC parameter is the number of drives (spindles),
+	 * which we need to translate to a number-of-pages-to-prefetch target.
+	 * The target value is stashed in *extra and then assigned to the actual
+	 * variable by assign_effective_io_concurrency.
+	 *
+	 * The expected number of prefetch pages needed to keep N drives busy is:
+	 *
+	 * drives |   I/O requests
+	 * -------+----------------
+	 *		1 |   1
+	 *		2 |   2/1 + 2/2 = 3
+	 *		3 |   3/1 + 3/2 + 3/3 = 5 1/2
+	 *		4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+	 *		n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

Well, even the comment right next after the formula says that:

* Experimental results show that both of these formulas aren't
* aggressiveenough, but we don't really have any better proposals.

That's the reason why users generally either use 0 or some rather high
value (16 or 32 are the most common values see). The problem is that we
don't really care about the number of spindles (and not just because
SSDs don't have them at all), but about the target queue length per
device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs
are parallel by nature (stacking multiple chips with separate channels).

I doubt we can really improve the formula, except maybe for saying "we
want 16 requests per device" and multiplying the number by that. We
don't really have the necessary introspection to determine better values
(and it's not really possible, because the devices may be hidden behind
a RAID controller (or a SAN). So we can't really do much.

Maybe the best thing we can do is just completely abandon the "number of
spindles" idea, and just say "number of I/O requests to prefetch".
Possibly with an explanation of how to estimate it (devices * queue length).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tomas Vondra (#10)

Re: Allow a per-tablespace effective_io_concurrency setting

On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:

Maybe the best thing we can do is just completely abandon the "number of
spindles" idea, and just say "number of I/O requests to prefetch". Possibly
with an explanation of how to estimate it (devices * queue length).

I think that'd be a lot better.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Andres Freund (#9)

Re: Allow a per-tablespace effective_io_concurrency setting

On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote:

+     /*----------
+      * The user-visible GUC parameter is the number of drives (spindles),
+      * which we need to translate to a number-of-pages-to-prefetch target.
+      * The target value is stashed in *extra and then assigned to the actual
+      * variable by assign_effective_io_concurrency.
+      *
+      * The expected number of prefetch pages needed to keep N drives busy is:
+      *
+      * drives |   I/O requests
+      * -------+----------------
+      *              1 |   1
+      *              2 |   2/1 + 2/2 = 3
+      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
+      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+      *              n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Julien Rouhaud

rjuju123@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#10)

Re: Allow a per-tablespace effective_io_concurrency setting

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:

Hi

On 09/02/2015 03:53 PM, Andres Freund wrote:

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be
able to add a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to
something fitting.

Sorry for that.

@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
*node, EState *estate, int eflags) { BitmapHeapScanState
*scanstate; Relation currentRelation; +#ifdef USE_PREFETCH +
int new_io_concurrency; +#endif

/* check for unsupported flags */ Assert(!(eflags &
(EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
int eflags) */ currentRelation = ExecOpenScanRelation(estate,
node->scan.scanrelid, eflags);

+#ifdef USE_PREFETCH + /* check if the
effective_io_concurrency has been overloaded for the + *
tablespace storing the relation and compute the
target_prefetch_pages, + * or just get the current
target_prefetch_pages + */ + new_io_concurrency =
get_tablespace_io_concurrency( +
currentRelation->rd_rel->reltablespace); + + +
scanstate->target_prefetch_pages = target_prefetch_pages; + +
if (new_io_concurrency != effective_io_concurrency) + { +
double prefetch_pages; + if
(compute_io_concurrency(new_io_concurrency, &prefetch_pages)) +
scanstate->target_prefetch_pages = rint(prefetch_pages); +
} +#endif

Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when
not supported, that's fine?

It's not just you. Dealing with code with plenty of ifdefs is
annoying - it compiles just fine most of the time, until you
compile it with some rare configuration. Then it either starts
producing strange warnings or the compilation fails entirely.

It might make a tiny difference on builds without prefetching
support because of code size, but realistically I think it's
noise.

Especially changing the size of externally visible structs
depending ona configure detected ifdef seems wrong to me.

+100 to that

I totally agree. I'll remove the ifdefs.

Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

+    /*---------- +     * The user-visible GUC parameter is the
number of drives (spindles), +     * which we need to translate
to a number-of-pages-to-prefetch target. +     * The target
value is stashed in *extra and then assigned to the actual +
* variable by assign_effective_io_concurrency. +     * +     *
The expected number of prefetch pages needed to keep N drives 
busy is: +     * +     * drives |   I/O requests +     *
-------+---------------- +     *        1 |   1 +     *
2 |   2/1 + 2/2 = 3 +     *        3 |   3/1 + 3/2 + 3/3 = 5
1/2 +     *        4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 +     *
n |   n * H(n)
I know you just moved this code. But: I don't buy this formula.
Like at all. Doesn't queuing and reordering entirely invalidate
the logic here?
Well, even the comment right next after the formula says that:

* Experimental results show that both of these formulas aren't *
aggressiveenough, but we don't really have any better proposals.

That's the reason why users generally either use 0 or some rather
high value (16 or 32 are the most common values see). The problem
is that we don't really care about the number of spindles (and not
just because SSDs don't have them at all), but about the target
queue length per device. Spinning rust uses TCQ/NCQ to optimize the
head movement, SSDs are parallel by nature (stacking multiple chips
with separate channels).

I doubt we can really improve the formula, except maybe for saying
"we want 16 requests per device" and multiplying the number by
that. We don't really have the necessary introspection to determine
better values (and it's not really possible, because the devices
may be hidden behind a RAID controller (or a SAN). So we can't
really do much.

Maybe the best thing we can do is just completely abandon the
"number of spindles" idea, and just say "number of I/O requests to
prefetch". Possibly with an explanation of how to estimate it
(devices * queue length).

I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?

- --
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV50opAAoJELGaJ8vfEpOqve4H/0ZJCoFb0wHtArkGye6ietks
9uahdJy5ixO4J+AZsf2mVxV/DZK7dhK8rWIXt6yS3kfYfPDB79cRFWU5EgjEGAHB
qcB7wXCa5HibqLySgQct3zhVDj3CN3ucKT3LVp9OC9mrH2mnGtAp7PYkjd/HqBwd
AzW05pu21Ivi/z2gUBOdxNEEDxCLu8T1OtYq3WY9l7Yc4HfLG3huYLQD2LoRFRFn
lWwhQifML6uKzz7w3MfZrK4i2ffGGv/r1afHcpZvN3UsB5te1fSzr8KcUeJL7+Zg
xJTKwppiEMHpxokn5lw4LzYkjYD7W1fvwLnJhzRrs7dXGPl6rZtLmasyCld4FVk=
=r2jE
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Julien Rouhaud

rjuju123@gmail.com

over 10 years ago

In reply to: Andres Freund (#9)

Re: Allow a per-tablespace effective_io_concurrency setting

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

You also didn't touch /* * How many buffers PrefetchBuffer callers
should try to stay ahead of their * ReadBuffer calls by. This is
maintained by the assign hook for * effective_io_concurrency. Zero
means "never prefetch". */ int target_prefetch_pages = 0; which
surely doesn't make sense anymore after these changes.

But do we even need that variable now?

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.

diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
(INT_MAX / 1024) #endif
+/* upper limit for effective_io_concurrency */ +#define
MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
for ALTER SYSTEM. * This file will be used to store values of
configuration parameters @@ -256,6 +259,8 @@ extern int
temp_file_limit;

extern int num_temp_buffers;

+extern int effective_io_concurrency; +
target_prefetch_pages is declared in bufmgr.h - that seems like a
better place for these.

I was rather sceptical about that too. I'll move these in bufmgr.h.

Regards.

- --
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Bruce Momjian (#12)

Re: Allow a per-tablespace effective_io_concurrency setting

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:

On 2 Sep 2015 14:54, "Andres Freund" <andres@anarazel.de> wrote:
+     /*----------
+      * The user-visible GUC parameter is the number of drives (spindles),
+      * which we need to translate to a number-of-pages-to-prefetch target.
+      * The target value is stashed in *extra and then assigned to the actual
+      * variable by assign_effective_io_concurrency.
+      *
+      * The expected number of prefetch pages needed to keep N drives busy is:
+      *
+      * drives |   I/O requests
+      * -------+----------------
+      *              1 |   1
+      *              2 |   2/1 + 2/2 = 3
+      *              3 |   3/1 + 3/2 + 3/3 = 5 1/2
+      *              4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+      *              n |   n * H(n)
I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?
I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

There are different meanings of "busy". If I get the coupon collector
problem right (after quickly skimming the wikipedia article today), it
effectively makes sure that each "spindle" has at least 1 request in the
queue. Which sucks in practice, because on spinning rust it makes
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one
of the multiple channels.

On spinning drives, it's usually good to keep the iodepth>=4. For
example this 10k Seagate drive [1]http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review can do ~450 random IOPS with
iodepth=16, while 10k drive should be able to do ~150 IOPS (with
iodepth=1). The other SAS drives behave quite similarly.

[1]: http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review

On SSDs the good values usually start at 16, depending on the model (and
controller), and size (large SSDs are basically multiple small ones
glued together, thus have more channels).

This is why the numbers from coupon collector are way too low in many
cases. (OTOH this is done per backend, so if there are multiple backends
doing prefetching ...)

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

A good starting point for this setting is the number of separate
drives comprising a RAID 0 stripe or RAID 1 mirror being used for
the database. (For RAID 5 the parity drive should not be counted.)
However, if the database is often busy with multiple queries
issued in concurrent sessions, lower values may be sufficient to
keep the disk array busy. A value higher than needed to keep the
disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warn
against increasing the value further.

Moreover, ISTM it's very unclear what value to use even if you know the
number of devices and optimal iodepth. Setting (devices * iodepth)
doesn't really make much sense, because that effectively computes

(devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at
least one request for each of them", right? I guess we actually want

(devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need
track both ndevices and iodepth.

There probably is a value X so that

X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).

think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

Maybe. We only use this in Bitmap Index Scan at this point, and I don't
see any proposals to introduce this to other places. So no opinion.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need
to issue to keep all the spindles busy.

I don't really see how that would help us?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Merlin Moncure (#1)

Re: Allow a per-tablespace effective_io_concurrency setting

On 09/02/2015 02:25 PM, Tomas Vondra wrote:

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM3a593c803674048e94b1aebbd2d16ded211f5b8802ccc594ea824f30dbaeb070b2098ffa4bc26e2c2531662ed7cd7efc@asav-1.01.com

#17

Merlin Moncure

mmoncure@gmail.com

over 10 years ago

In reply to: Josh Berkus (#16)

Re: Allow a per-tablespace effective_io_concurrency setting

On Wed, Sep 2, 2015 at 4:31 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

That's the exact opposite of my findings on intel S3500 (see:
/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com).

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Josh Berkus (#16)

Re: Allow a per-tablespace effective_io_concurrency setting

On Wed, Sep 2, 2015 at 2:31 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

Perhaps one reason is that the planner assumes it will get no benefit from
this setting, meaning it is somewhat unlikely to choose the types of plans
which would actually show a benefit from higher values.

Cheers,

Jeff

#19

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Josh Berkus (#16)

Re: Allow a per-tablespace effective_io_concurrency setting

On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

I saw pretty much the opposite - the benefits seldomly were significant
below 30 or so. Even on single disks. Which actually isn't that
surprising - to be actually beneficial (that is, turn an IO into a CPU
bound workload) the prefetched buffer needs to actually have been read
in by the time its needed. In many queries processing a single heap page
takes far shorter than prefetching the data from storage, even if it's
on good SSDs.

Therefore what you actually need is a queue of prefetches for the next
XX buffers so that between starting a prefetch and actually needing the
buffer ienough time has passed that the data is completely read in. And
the point is that that's the case even for a single rotating disk!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Bruce Momjian (#12)

Re: Allow a per-tablespace effective_io_concurrency setting

On 2015-09-02 19:49:13 +0100, Greg Stark wrote:

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

My point is that that's just the entirely wrong way to model
prefetching. Prefetching can be massively beneficial even if you only
have a single platter! Even if there were no queues on the hardware or
OS level! Concurrency isn't the right way to look at prefetching.

You need to prefetch so far ahead that you'll never block on reading
heap pages - and that's only the case if processing the next N heap
blocks takes longer than the prefetch of the N+1 th page. That doesn't
mean there continously have to be N+1 prefetches in progress - in fact
that actually often will only be the case for the first few, after that
you hopefully are bottlnecked on CPU.

If you additionally take into account hardware realities where you have
multiple platters, multiple spindles, command queueing etc, that's even
more true. A single rotation of a single platter with command queuing
can often read several non consecutive blocks if they're on a similar

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Andres Freund (#19)

#22

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Tomas Vondra (#21)

#23

Bruce Momjian