Raw device I/O for large objects

Started by Georgi Chulkovover 18 years ago8 messages
#1Georgi Chulkov
godji@metapenguin.org

Hello,

I am a graduate student of computer science and I have been looking at
PostgreSQL for my master's thesis work.

I am looking into implementing raw device I/O for large objects into
PostgreSQL (maybe for all storage, I'm not sure which would be
easier/better). I am extremely new to the codebase, however.

Could someone please point me to the right places to look at, and how/where to
get started? Would such a development be useful at all? Is anyone working on
anything related?

Any feedback / information would be highly appreciated!

Thanks,
Georgi

#2Sibte Abbas
sibtay@gmail.com
In reply to: Georgi Chulkov (#1)
Re: Raw device I/O for large objects

On 9/17/07, Georgi Chulkov <godji@metapenguin.org> wrote:

Could someone please point me to the right places to look at, and how/where to
get started? Would such a development be useful at all? Is anyone working on
anything related?

Any feedback / information would be highly appreciated!

http://www.postgresql.org/docs/techdocs
http://www.postgresql.org/docs/faq/

The postgresql documentation:
http://www.postgresql.org/docs/8.2/interactive/index.html

Also, If you have the source, the src/tools/backend directory has some
useful material for starters.

regards,
--
Sibte Abbas

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Georgi Chulkov (#1)
Re: Raw device I/O for large objects

Georgi Chulkov <godji@metapenguin.org> writes:

I am looking into implementing raw device I/O for large objects into
PostgreSQL (maybe for all storage, I'm not sure which would be
easier/better).

We've heard this idea proposed before, and it's been shot down as a poor
use of development effort every time. Check the archives for previous
threads, but the basic argument goes like this: when Oracle et al did
that twenty years ago, it was a good idea because (1) operating systems
tended to have sucky filesystems, (2) performance and reliability
properties of same were not very consistent across platforms, and (3)
being large commercial software vendors they could afford to throw lots
of warm bodies at anything that seemed like a bottleneck. None of those
arguments holds up well for us today however. If you think you want to
reimplement a filesystem you need to have some pretty concrete reasons
why you can outsmart all the smart folks who have worked on
your-favorite-OS's filesystems for lo these many years. There's also
the fact that on any reasonably modern disk hardware, "raw I/O" is
anything but.

My opinion is that there is lots of lower-hanging fruit elsewhere.
You can find some ideas on our TODO list, or troll the pghackers
list archives for other discussions.

regards, tom lane

#4Georgi Chulkov
godji@metapenguin.org
In reply to: Tom Lane (#3)
Re: Raw device I/O for large objects

Hi,

We've heard this idea proposed before, and it's been shot down as a poor
use of development effort every time. Check the archives for previous
threads, but the basic argument goes like this: when Oracle et al did
that twenty years ago, it was a good idea because (1) operating systems
tended to have sucky filesystems, (2) performance and reliability
properties of same were not very consistent across platforms, and (3)
being large commercial software vendors they could afford to throw lots
of warm bodies at anything that seemed like a bottleneck. None of those
arguments holds up well for us today however. If you think you want to
reimplement a filesystem you need to have some pretty concrete reasons
why you can outsmart all the smart folks who have worked on
your-favorite-OS's filesystems for lo these many years. There's also
the fact that on any reasonably modern disk hardware, "raw I/O" is
anything but.

Thanks, I agree with all your arguments.

Here's the reason why I'm looking at raw device storage for large objects only
(as opposed to all tables): with raw device I/O I can control, to an extent,
spatial locality. So, if I have an application that wants to store N large
objects (totaling several gigabytes) and read them back in some order that is
well-known in advance, I could store my large objects in that order on the
raw device.* Sequentially reading them back would then be very efficient.
With a file system underneath, I don't have that freedom. (Such a scenario
occurs with raster databases, for example.)

* assuming I have a way to communicate these requirements; that's a whole new
problem

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device
interface for large objects?
2. How feasible it is to decouple general table storage from large object
storage?

Thank you for your time,

Georgi

#5Luke Lonergan
LLonergan@greenplum.com
In reply to: Georgi Chulkov (#4)
Re: Raw device I/O for large objects

Index organized tables would do this and it would be a generic capability.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Georgi Chulkov [mailto:godji@metapenguin.org]
Sent: Monday, September 17, 2007 11:50 PM Eastern Standard Time
To: Tom Lane
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Raw device I/O for large objects

Hi,

We've heard this idea proposed before, and it's been shot down as a poor
use of development effort every time. Check the archives for previous
threads, but the basic argument goes like this: when Oracle et al did
that twenty years ago, it was a good idea because (1) operating systems
tended to have sucky filesystems, (2) performance and reliability
properties of same were not very consistent across platforms, and (3)
being large commercial software vendors they could afford to throw lots
of warm bodies at anything that seemed like a bottleneck. None of those
arguments holds up well for us today however. If you think you want to
reimplement a filesystem you need to have some pretty concrete reasons
why you can outsmart all the smart folks who have worked on
your-favorite-OS's filesystems for lo these many years. There's also
the fact that on any reasonably modern disk hardware, "raw I/O" is
anything but.

Thanks, I agree with all your arguments.

Here's the reason why I'm looking at raw device storage for large objects only
(as opposed to all tables): with raw device I/O I can control, to an extent,
spatial locality. So, if I have an application that wants to store N large
objects (totaling several gigabytes) and read them back in some order that is
well-known in advance, I could store my large objects in that order on the
raw device.* Sequentially reading them back would then be very efficient.
With a file system underneath, I don't have that freedom. (Such a scenario
occurs with raster databases, for example.)

* assuming I have a way to communicate these requirements; that's a whole new
problem

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device
interface for large objects?
2. How feasible it is to decouple general table storage from large object
storage?

Thank you for your time,

Georgi

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#6Markus Schiltknecht
markus@bluegap.ch
In reply to: Georgi Chulkov (#4)
Re: Raw device I/O for large objects

Hi,

Georgi Chulkov wrote:

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device
interface for large objects?

No, because file systems also try to do what you outline above. They
certainly don't split sequential data up into blocks and distribute them
randomly over the device, at least not without having a pretty good
reason to do so (with which you'd also have to fight).

The possible gain achievable is pretty minimal, especially in
conjunction with a (hopefully battery backed) write cache.

2. How feasible it is to decouple general table storage from large object
storage?

I think that would be the easiest part. I would go for a pluggable
storage implementation, selectable per tablespace. But then again, I
wouldn't do it at all. After all, this is what MySQL is doing. And we
certainly don't want to repeat their mistakes! Or do you know anybody
who goes like: "Yepee, multiple storages engines to choose from for my
(un)valuable data, lets put some here and others there...".

Let's optimize the *one* storage engine we have and try to make that
work well together with the various filesystems it uses. Because
filesystems are already very good in what they are used for. (And we are
glad we can use a filesystem and don't need to implement one ourselves).

Regards

Markus

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Georgi Chulkov (#4)
Re: Raw device I/O for large objects

Georgi Chulkov <godji@metapenguin.org> writes:

Here's the reason why I'm looking at raw device storage for large objects only
(as opposed to all tables): with raw device I/O I can control, to an extent,
spatial locality. So, if I have an application that wants to store N large
objects (totaling several gigabytes) and read them back in some order that is
well-known in advance, I could store my large objects in that order on the
raw device.* Sequentially reading them back would then be very efficient.
With a file system underneath, I don't have that freedom. (Such a scenario
occurs with raster databases, for example.)

Not sure I buy that argument. If you have loaded these large objects in
the desired order, then the data will be consecutively located in
pg_largeobject, and if the underlying filesystem is at all sane about
where it extends a growing file, the data will be pretty much
consecutive on disk too. You could probably get marginal improvements
by cutting out the middleman but I'm not sure there's reason to think
there'd be spectacular improvements.

Please allow me to ask then:
1. In your opinion, would the above scenario indeed benefit from a raw-device
interface for large objects?

I don't say it wouldn't benefit. What I'm questioning is the size of
the benefit compared to the amount of work required to get it.
"Supporting raw I/O" is not some trivial bit of work --- you essentially
have to reimplement your own filesystem, because like it or not you
*do* have to think about space management. If we went in this direction
we'd be buying into a lot of work, not to mention a lot of ongoing
portability headaches. So far no one's been able to make a case that
it's worth that level of effort.

2. How feasible it is to decouple general table storage from large object
storage?

You might try digging into the original POSTGRES sources --- at one time
there were several different large-object APIs. I'm not sure if they
exposed them just as different sets of access functions or if there was
something more elegant. My own feeling though is that you probably
don't want to go that way, because with outside-the-database storage you
lose transactional behavior (unless you're up for reinventing that
wheel too). I'd try replacing md.c, or maybe resurrecting smgr.c as
something that can really switch between more than one underlying
storage manager.

regards, tom lane

#8Georgi Chulkov
godji@metapenguin.org
In reply to: Tom Lane (#7)
Re: Raw device I/O for large objects

Thank you everyone for your valuable input! I will have a look at some other
part of PostgreSQL, and maybe find something else to do instead.

Best,
Georgi