Does larger i/o size make sense?

Started by Kohei KaiGaiover 12 years ago11 messages
#1Kohei KaiGai
kaigai@kaigai.gr.jp

Hello,

A few days before, I got a question as described in the subject line on
a discussion with my colleague.

In general, larger i/o size per system call gives us wider bandwidth on
sequential read, than multiple system calls with smaller i/o size.
Probably, people knows this heuristics.

On the other hand, PostgreSQL always reads database files by BLCKSZ
(= usually, 8KB) when referenced block was not on the shared buffer,
however, it doesn't seem to me it can pull maximum performance of
modern storage system.

I'm not certain whether we had discussed this kind of ideas, or not.
So, I'd like to see the reason why we stick on the fixed length i/o size,
if similar ideas were rejected before.

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.
It probably make sense if we can expect upcoming block references
shall be on the neighbor blocks; that is typical sequential read workload.

Of course, we shall need to solve some complicated stuff, like prevention
of fragmentation on shared buffers, or enhancement of internal APIs of
storage manager to accept larger i/o size.
Furthermore, it seems to me this idea has worth to investigate.

Any comments please. Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Kohei KaiGai (#1)
Re: Does larger i/o size make sense?

On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Hello,

A few days before, I got a question as described in the subject line on
a discussion with my colleague.

In general, larger i/o size per system call gives us wider bandwidth on
sequential read, than multiple system calls with smaller i/o size.
Probably, people knows this heuristics.

On the other hand, PostgreSQL always reads database files by BLCKSZ
(= usually, 8KB) when referenced block was not on the shared buffer,
however, it doesn't seem to me it can pull maximum performance of
modern storage system.

I'm not certain whether we had discussed this kind of ideas, or not.
So, I'd like to see the reason why we stick on the fixed length i/o size,
if similar ideas were rejected before.

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.
It probably make sense if we can expect upcoming block references
shall be on the neighbor blocks; that is typical sequential read workload.

Of course, we shall need to solve some complicated stuff, like prevention
of fragmentation on shared buffers, or enhancement of internal APIs of
storage manager to accept larger i/o size.
Furthermore, it seems to me this idea has worth to investigate.

Any comments please. Thanks,

Isn't this dealt with at least in part by effective i/o concurrency
and o/s readahead?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#2)
Re: Does larger i/o size make sense?

Merlin Moncure <mmoncure@gmail.com> writes:

On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.

Isn't this dealt with at least in part by effective i/o concurrency
and o/s readahead?

I should think so. It's very difficult to predict future block-access
requirements for anything except a seqscan, and for that, we expect the
OS will detect the access pattern and start reading ahead on its own.

Another point here is that you could get some of the hoped-for benefit
just by increasing BLCKSZ ... but nobody's ever demonstrated any
compelling benefit from larger BLCKSZ (except on specialized workloads,
if memory serves).

The big-picture problem with work in this area is that no matter how you
do it, any benefit is likely to be both platform- and workload-specific.
So the prospects for getting a patch accepted aren't all that bright.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Tom Lane (#3)
Re: Does larger i/o size make sense?

2013/8/23 Tom Lane <tgl@sss.pgh.pa.us>:

Merlin Moncure <mmoncure@gmail.com> writes:

On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.

Isn't this dealt with at least in part by effective i/o concurrency
and o/s readahead?

I should think so. It's very difficult to predict future block-access
requirements for anything except a seqscan, and for that, we expect the
OS will detect the access pattern and start reading ahead on its own.

Another point here is that you could get some of the hoped-for benefit
just by increasing BLCKSZ ... but nobody's ever demonstrated any
compelling benefit from larger BLCKSZ (except on specialized workloads,
if memory serves).

The big-picture problem with work in this area is that no matter how you
do it, any benefit is likely to be both platform- and workload-specific.
So the prospects for getting a patch accepted aren't all that bright.

Hmm. I might overlook effect of readahead on operating system level.
Indeed, sequential scan has a workload that easily launches it, so
smaller i/o size in application level will be hidden.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#3)
Re: Does larger i/o size make sense?

The big-picture problem with work in this area is that no matter how you
do it, any benefit is likely to be both platform- and workload-specific.
So the prospects for getting a patch accepted aren't all that bright.

Indeed.

Would it make sense to have something easier to configure that recompiling
postgresql and managing a custom executable, say a block size that could
be configured from initdb and/or postmaster.conf, or maybe per-object
settings specified at creation time?

Note that the block size may also affect the cache behavior, for instance
for pure random accesses, more "recently accessed" tuples can be kept in
memory if the pages are smaller. So there are other reasons to play with
the blocksize than I/O access times, and an option to do that more easily
would help.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Fabien COELHO (#5)
Re: Does larger i/o size make sense?

2013/8/23 Fabien COELHO <coelho@cri.ensmp.fr>:

The big-picture problem with work in this area is that no matter how you
do it, any benefit is likely to be both platform- and workload-specific.
So the prospects for getting a patch accepted aren't all that bright.

Indeed.

Would it make sense to have something easier to configure that recompiling
postgresql and managing a custom executable, say a block size that could be
configured from initdb and/or postmaster.conf, or maybe per-object settings
specified at creation time?

I love the idea of per-object block size setting according to expected workload;
maybe configured by DBA. In case when we have to run sequential scan on
large tables, larger block size may have less pain than interruption per 8KB
boundary to switch the block being currently focused on, even though random
access via index scan loves smaller block size.

Note that the block size may also affect the cache behavior, for instance
for pure random accesses, more "recently accessed" tuples can be kept in
memory if the pages are smaller. So there are other reasons to play with the
blocksize than I/O access times, and an option to do that more easily would
help.

I see. Uniformed block-size could simplify the implementation, thus no need
to worry about a scenario that continuous buffer allocation push out pages to
be kept in memory.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Kohei KaiGai (#6)
Re: Does larger i/o size make sense?

Would it make sense to have something easier to configure that recompiling
postgresql and managing a custom executable, say a block size that could be
configured from initdb and/or postmaster.conf, or maybe per-object settings
specified at creation time?

I love the idea of per-object block size setting according to expected workload;

My 0.02ᅵ: wait to see whether the idea get some positive feedback by core
people before investing any time in that...

The per object would be a lot of work. A per initdb (so per cluster)
setting (block size, wal size...) would much easier to implement, but it
impacts for storage format.

large tables, larger block size may have less pain than interruption per 8KB
boundary to switch the block being currently focused on, even though random
access via index scan loves smaller block size.

Yep, as Tom noted, this is really workload specific.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Greg Stark
stark@mit.edu
In reply to: Kohei KaiGai (#1)
Re: Does larger i/o size make sense?

On Thu, Aug 22, 2013 at 8:53 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.
It probably make sense if we can expect upcoming block references
shall be on the neighbor blocks; that is typical sequential read workload.

I think it makes more sense to use scatter gather i/o or async i/o to read
to regular sized buffers scattered around memory than to restrict the
buffers to needing to be contiguous.

As others said, Postgres depends on the OS buffer cache to do readahead.
The scenario where the above becomes interesting is if it's paired with a
move to directio or other ways of skipping the buffer cache. Double caching
is a huge waste and leads to lots of inefficiencies.

The blocking issue there is that Postgres doesn't understand much about the
underlying hardware storage. If there were APIs to find out more about it
from the kernel -- how much further before the end of the raid chunk, how
much parallelism it has, how congested the i/o channel is, etc -- then
Postgres might be on par with the kernel and able to eliminate the double
buffering inefficiency and might even be able to do better if it
understands its own workload better.

If Postgres did that then it would be necessary to be able to initiate i/o
on multiple buffers in parallel. That can be done using scatter gather i/o
such as readv() and writev() but that would mean blocking on reading blocks
that might not be needed until the future. Or it could be done using libaio
to initiate i/o and return control as soon as the needed data is available
while other i/o is still pending.

--
greg

#9Kevin Grittner
kgrittn@ymail.com
In reply to: Tom Lane (#3)
Re: Does larger i/o size make sense?

Tom Lane <tgl@sss.pgh.pa.us>

Another point here is that you could get some of the hoped-for
benefit just by increasing BLCKSZ ... but nobody's ever
demonstrated any compelling benefit from larger BLCKSZ (except on
specialized workloads, if memory serves).

I think I've seen a handful of reports of performance differences
with different BLCKSZ builds (perhaps not all on community lists).
My recollection is that some people sifting through data in data
warehouse environments see a performance benefit up to 32KB, but
that tests of GiST index performance with different sizes showed
better performance with smaller sizes down to around 2KB.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Josh Berkus
josh@agliodbs.com
In reply to: Kohei KaiGai (#1)
Re: Does larger i/o size make sense?

Kevin,

I think I've seen a handful of reports of performance differences
with different BLCKSZ builds (perhaps not all on community lists).
My recollection is that some people sifting through data in data
warehouse environments see a performance benefit up to 32KB, but
that tests of GiST index performance with different sizes showed
better performance with smaller sizes down to around 2KB.

I believe that Greenplum currently uses 128K. There's a definite
benefit for the DW use-case.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Greg Smith
greg@2ndQuadrant.com
In reply to: Josh Berkus (#10)
Re: Does larger i/o size make sense?

On 8/27/13 3:54 PM, Josh Berkus wrote:

I believe that Greenplum currently uses 128K. There's a definite
benefit for the DW use-case.

Since Linux read-ahead can easily give big gains on fast storage, I
normally set that to at least 4096 sectors = 2048KB. That's a lot
bigger than even this, and definitely necessary for reaching maximum
storage speed.

I don't think that the block size change alone will necessarily
duplicate the gains on seq scans that Greenplum gets though. They've
done a lot more performance optimization on that part of the read path
than just the larger block size.

As far as quantifying whether this is worth chasing, the most useful
thing to do here is find some fast storage and profile the code with
different block sizes at a large read-ahead. I wouldn't spend a minute
on trying to come up with a more complicated management scheme until the
potential gain is measured.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers