Raw device on PostgreSQL
Hey,
for an university project I'm currently doing some research on
PostgreSQL. I was wondering if hypothetically it would be possible to
implement a raw device system to PostgreSQL. I know that the
disadvantages would probably be higher than the advantages compared to
working with the file system. Just hypothetically: Would it be possible
to change the source code of PostgreSQL so a raw device system could be
implemented, or would that cause a chain reaction so that basically one
would have to rewrite almost the entire code, because too many elements
of PostgreSQL rely on the file system?
Best regards
Greetings,
* Benjamin Schaller (benjamin.schaller@s2018.tu-chemnitz.de) wrote:
for an university project I'm currently doing some research on PostgreSQL. I
was wondering if hypothetically it would be possible to implement a raw
device system to PostgreSQL. I know that the disadvantages would probably be
higher than the advantages compared to working with the file system. Just
hypothetically: Would it be possible to change the source code of PostgreSQL
so a raw device system could be implemented, or would that cause a chain
reaction so that basically one would have to rewrite almost the entire code,
because too many elements of PostgreSQL rely on the file system?
yes, it'd be possible, no, you wouldn't have to rewrite all of PG.
Instead, if you want it to be performant at all, you'd have to write
lots of new code to do all the things the filesystem and kernel do for
us today.
Thanks,
Stephen
On 4/28/20 10:43 AM, Benjamin Schaller wrote:
for an university project I'm currently doing some research on
PostgreSQL. I was wondering if hypothetically it would be possible to
implement a raw device system to PostgreSQL. I know that the
disadvantages would probably be higher than the advantages compared to
working with the file system. Just hypothetically: Would it be possible
to change the source code of PostgreSQL so a raw device system could be
implemented, or would that cause a chain reaction so that basically one
would have to rewrite almost the entire code, because too many elements
of PostgreSQL rely on the file system?
It would require quite a bit of work since 1) PostgreSQL stores its data
in multiple files and 2) PostgreSQL currently supports only synchronous
buffered IO.
To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.
Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.
https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
Andreas
On Tue, Apr 28, 2020 at 02:10:51PM +0200, Andreas Karlsson wrote:
On 4/28/20 10:43 AM, Benjamin Schaller wrote:
for an university project I'm currently doing some research on
PostgreSQL. I was wondering if hypothetically it would be possible
to implement a raw device system to PostgreSQL. I know that the
disadvantages would probably be higher than the advantages compared
to working with the file system. Just hypothetically: Would it be
possible to change the source code of PostgreSQL so a raw device
system could be implemented, or would that cause a chain reaction so
that basically one would have to rewrite almost the entire code,
because too many elements of PostgreSQL rely on the file system?It would require quite a bit of work since 1) PostgreSQL stores its
data in multiple files and 2) PostgreSQL currently supports only
synchronous buffered IO.
Not sure how that's related to raw devices, which is what Benjamin was
asking about. AFAICS most of the changes would be in smgr.c and md.c,
but I might be wrong.
I'd imagine supporting raw devices would require implementing some sort
of custom file system on the device, and I'd expect it to work with
relation segments just fine. So why would that be a problem?
The synchronous buffered I/O is a bigger challenge, I guess, but then
again - you could continue using synchronous I/O even with raw devices.
To get the performance benefits from using raw devices I think you
would want to add support for asynchronous IO to PostgreSQL rather
than implementing your own layer to emulate the kernel's buffered IO.Andres Freund did a talk on aync IO in PostgreSQL earlier this year.
It was not recorded but the slides are available.https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se> wrote:
It would require quite a bit of work since 1) PostgreSQL stores its data
in multiple files and 2) PostgreSQL currently supports only synchronous
buffered IO.To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I
implemented a benchmarkable proof-of-concept patch for direct I/O and
asynchronous I/O (for libaio and POSIX). We made that patch public, so it
should be on the list somewhere. But, we began to run into performance
issues related to buffer manager scaling in terms of locking and,
specifically, replacement. We began prototyping alternate buffer managers
(going back to the old MRU/LRU model with midpoint insertion and testing a
2Q variant) but that wasn't public. I had also prototyped raw device
support, which is a good amount of work and required implementing a custom
filesystem (similar to Oracle's ASM) within the storage manager. It's
probably a bit harder now than it was then, given the number of different
types of file access.
--
Jonah H. Harris
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.
There's another very large problem with using raw devices: on pretty
much every platform, you don't get to do that without running as root.
It is not easy to express how hard a sell it would be to even consider
allowing Postgres to run as root. Between the security issues, and
the generally poor return-on-investment we'd get from reinventing
our own filesystem and I/O scheduler, I just don't see this sort of
thing ever going forward. Direct and/or async I/O seems a lot more
plausible.
regards, tom lane
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.
Agreed.
I've often wondered if the RDBMSs that supported raw devices did so
*because* there was no other way to get unbuffered I/O on some systems
at the time (for example it looks like Solaris didn't have direct I/O
until 2.6 in 1997?). Last I heard, raw devices weren't recommended
anymore on the system I'm thinking of because they're more painful to
manage than regular filesystems and there's little to no gain. Back
in ancient times, before BSD4.2 introduced it in 1983 there was
apparently no fsync() system call on any strain of Unix, so I guess
database reliability must have been an uphill battle on early Unix
buffered I/O (I wonder if the Ingres/Postgres people asked them to add
that?!). It must have been very appealing to sidestep the whole thing
for multiple reasons. One key thing to note is that the well known
RDBMSs that can use raw devices also deal with regular filesystems by
creating one or more large data files, and then manage the space
inside those to hold all their tables and indexes. That is, they
already have their own system to manage separate database objects and
allocate space etc, and don't have to do any regular filesystem
meta-data manipulation during transactions (which has all kinds of
problems). That means they already have the complicated code that you
need to do that, but we don't: we have one (or more) file per table or
index, so our database relies on the filesystem as kind of lower level
database of relfilenode->blocks. That's probably the main work
required to make this work, and might be a valuable thing to have
independently of whether you stick it on a raw device, a big data
file, NV RAM or some other kind of storage system -- but it's a really
difficult project.
On Wed, Apr 29, 2020 at 8:34 PM Jonah H. Harris <jonah.harris@gmail.com>
wrote:
On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson <andreas@proxel.se>
wrote:To get the performance benefits from using raw devices I think you would
want to add support for asynchronous IO to PostgreSQL rather than
implementing your own layer to emulate the kernel's buffered IO.Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
was not recorded but the slides are available.https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I
implemented a benchmarkable proof-of-concept patch for direct I/O and
asynchronous I/O (for libaio and POSIX). We made that patch public, so it
should be on the list somewhere. But, we began to run into performance
issues related to buffer manager scaling in terms of locking and,
specifically, replacement. We began prototyping alternate buffer managers
(going back to the old MRU/LRU model with midpoint insertion and testing a
2Q variant) but that wasn't public. I had also prototyped raw device
support, which is a good amount of work and required implementing a custom
filesystem (similar to Oracle's ASM) within the storage manager. It's
probably a bit harder now than it was then, given the number of different
types of file access.
Here's a hack job merge of that preliminary PoC AIO/DIO patch against
13devel. This was designed to keep the buffer manager clean using AIO and
is write-only. I'll have to dig through some of my other old Postgres 8.x
patches to find the AIO-based prefetching version with aio_req_t modified
to handle read vs. write in FileAIO. Also, this will likely have an issue
with O_DIRECT as additional buffer manager alignment is needed and I
haven't tracked it down in 13 yet. As my default development is on a Mac, I
have POSIX AIO only. As such, I can't natively play with the O_DIRECT or
libaio paths to see if they work without going into Docker or VirtualBox -
and I don't care that much right now :)
The code is nasty, but maybe it will give someone ideas. If I get some time
to work on it, I'll rewrite it properly.
--
Jonah H. Harris
Attachments:
13dev_aiodio_latest.patchapplication/octet-stream; name=13dev_aiodio_latest.patchDownload+635-29
On 30/4/20 6:22, Thomas Munro wrote:
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.Agreed.
[snip] That's probably the main work
required to make this work, and might be a valuable thing to have
independently of whether you stick it on a raw device, a big data
file, NV RAM
^^^^^^ THIS, with NV DIMMs / PMEM (persistent memory) possibly
becoming a hot topic in the not-too-distant future
or some other kind of storage system -- but it's a really
difficult project.
Indeed.... But you might have already pointed out the *only* required
feature for this to work: a "database" of relfilenode ---which is
actually an int, or rather, a tuple (relfilenode,segment) where both
components are 32-bit currently: that is, a 64bit "objectID" of sorts---
to "set of extents" ---yes, extents, not blocks: sequential I/O is still
faster in all known storage/persistent (vs RAM) systems---- where the
current I/O primitives would be able to write.
Some conversion from "absolute" (within the "file") to "relative"
(within the "tablespace") offsets would need to happen before delegating
to the kernel... or even dereferencing a pointer to an mmap'd region !,
but not much more, ISTM (but I'm far from an expert in this area).
Out of the top of my head:
CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2'
WITH (kind=raw, extent_min=4MB);
or something similar to that approac might do it.
Please note that I have purposefully specified "namespace 2" in an
"enterprise" NVME device, to show the possibility.
OR
use some filesystem (e.g. XFS) with DAX[1]https://www.kernel.org/doc/Documentation/filesystems/dax.txt (mount -o dax ) where
available along something equivalent to WITH(kind=mmaped)
... though the locking we currently get "for free" from the kernel would
need to be replaced by something else.
Indeed it seems like an enormous amount of work.... but it may well pay
off. I can't fully assess the effort, though
Just my .02€
[1]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt
Thanks,
/ J.L.
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
Also, this will likely have an issue with O_DIRECT as additional buffer manager alignment is needed and I haven't tracked it down in 13 yet. As my default development is on a Mac, I have POSIX AIO only. As such, I can't natively play with the O_DIRECT or libaio paths to see if they work without going into Docker or VirtualBox - and I don't care that much right now :)
Andres is prototyping with io_uring, which supersedes libaio and can
do much more stuff, notably buffered and unbuffered I/O; there's no
point in looking at libaio. I agree that we should definitely support
POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX
with one effort (those are the systems that use either kernel threads
or true async I/O down to the driver; Solaris and Linux also provide
POSIX AIO, but it's emulated with user space threads, which probably
wouldn't work well for our multi process design). The third API that
we'd want to support is Windows overlapped I/O with completion ports.
With those three APIs you can hit all systems in our build farm except
Solaris and OpenBSD, so they'd still use synchronous I/O (though we
could do our own emulation with worker processes pretty easily).
On Fri, May 1, 2020 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, May 1, 2020 at 12:28 PM Jonah H. Harris <jonah.harris@gmail.com>
wrote:Also, this will likely have an issue with O_DIRECT as additional buffer
manager alignment is needed and I haven't tracked it down in 13 yet. As my
default development is on a Mac, I have POSIX AIO only. As such, I can't
natively play with the O_DIRECT or libaio paths to see if they work without
going into Docker or VirtualBox - and I don't care that much right now :)Andres is prototyping with io_uring, which supersedes libaio and can
do much more stuff, notably buffered and unbuffered I/O; there's no
point in looking at libaio. I agree that we should definitely support
POSIX AIO, because that gets you macOS, FreeBSD, NetBSD, AIX, HPUX
with one effort (those are the systems that use either kernel threads
or true async I/O down to the driver; Solaris and Linux also provide
POSIX AIO, but it's emulated with user space threads, which probably
wouldn't work well for our multi process design). The third API that
we'd want to support is Windows overlapped I/O with completion ports.
With those three APIs you can hit all systems in our build farm except
Solaris and OpenBSD, so they'd still use synchronous I/O (though we
could do our own emulation with worker processes pretty easily).
Is it public? I saw the presentations, but couldn't find that patch
anywhere.
--
Jonah H. Harris