pg_largeobject high overhead

Started by Jason Newtonalmost 12 years ago3 messagesgeneral

nevion@gmail.com

almost 12 years ago

Hi,

Some scope of setting: I use postgres to manage metadata about field tests
and simulations part of that involves HDF5 files. These hdf5 files are
generated both with field testing and simulations so there's going to be a
modest amount of them - in the 10k region eventually - an older database
I've not yet migrated is 3K after a year. The DB is stored on a 14 drive
zfs raidz2 split over 2 groups of 7 disks of 3TB each along with lots of
field data (mainly images) out of the database and the rest of the inputs
in the database due to complexity of management (versions, variants,
garbage collection).

After encountering some 1GB limits while trying to post 300MB binary
compressed hdf5 files into postgres 9.2 (under transactions) on a 16GB 64
bit linux machine and coming upon the following thread:

/messages/by-id/CAFj8pRAcfKoiNp2uXeiZOd5kRX29n2ofsoLDh0w6ej7RxKoZyA@mail.gmail.com

I spent some time trying to get things to work as is, raising what limits I
could to no avail. So I decided to upgrade to 9.3 and use large binary
objects rather than making another file store due to a large convenience
of keeping everything in database. I noticed that my 35GB of files has
become 46GB of files, and there are 18522822 enteries in pg_largeobject
where as I only have 257 files ranging from 30MB to 400MB. To reiterate
the data is compressed via several HDF filers, so postgres isn't going to
do any better. It looks like unless there's 30% overhead for using
pg_largeobject which is pretty expensive!

I also came across this which mentions disabling of compression, which
could improve the efficiency:
/messages/by-id/CAHyXU0w_tNxBFHuWGHKCZF7GkE_jUQaT0f+dNcgTX8yx0+z3ew@mail.gmail.com

So this leads to the following questions:

1. How can I disable TOAST compression or whatever is going on in
pg_largeobject?
2. How in the world is 2KB block size for _large_ binary object a
reasonable size? As far as I can tell, it is introducing a very large
overhead.
3. Should I be changing LOBLKSIZE if this is the main factor of the 30%
overhead? Is this straight forward? Long term, is that going to bite me
back in the behind? I could maintain a package on opensuse's OBS,
incorporating a patch and rebuilding against upstream, but I don't really
have alot of ongoing time to deal with fallout if any.

I'd appreciate any insights,
-Jason

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Jason Newton (#1)

Re: pg_largeobject high overhead

Jason Newton <nevion@gmail.com> writes:

I spent some time trying to get things to work as is, raising what limits I
could to no avail. So I decided to upgrade to 9.3 and use large binary
objects rather than making another file store due to a large convenience
of keeping everything in database. I noticed that my 35GB of files has
become 46GB of files, and there are 18522822 enteries in pg_largeobject
where as I only have 257 files ranging from 30MB to 400MB. To reiterate
the data is compressed via several HDF filers, so postgres isn't going to
do any better. It looks like unless there's 30% overhead for using
pg_largeobject which is pretty expensive!

If your input data is uniformly incompressible, that's not too surprising.
pg_largeobject tuples hold BLCKSZ/4 bytes of data, plus some overhead, so
the only way that 4 of them will fit on a page is if compression saves
more than the overhead. You only need a couple percent compression
savings to make that work, but precompressed data might resist being
compressed even that much. So you end up with only 3 tuples per page
which makes it right about 33.33% overhead.

1. How can I disable TOAST compression or whatever is going on in
pg_largeobject?

The problem is that you're not getting any compression, so "disabling"
it won't help.

2. How in the world is 2KB block size for _large_ binary object a
reasonable size? As far as I can tell, it is introducing a very large
overhead.

According to the commentary in the source code (which you've evidently
read to some extent), a smaller block size was chosen to reduce the
overhead of partial updates of large-object data. You're probably not
going to do that, but PG doesn't know it. In any case, you can't choose
a LOBLKSIZE exceeding, or even equaling, the page size; so there's not
room for a huge change here. According to the above analysis, if you
want to pack more LO data per page, you'd actually be better off with
a *smaller* LOBLKSIZE to limit the per-page wastage.

3. Should I be changing LOBLKSIZE if this is the main factor of the 30%
overhead? Is this straight forward? Long term, is that going to bite me
back in the behind?

Yeah, it could, mainly because accidentally running a postmaster with
the wrong LOBLKSIZE compiled in would be disastrous. We should have a
cross-check on that value, and I'll see to it that one gets into future
releases, but it's not there today.

What would likely be the best bet for you is to run a modified build with
LOBLKSIZE held at 2K and BLCKSZ bumped up to 32K. That would cut the
wastage from 2K per 8K to 2K per 32K. Operationally, it'd be safer since
there *is* a pg_control cross-check on BLCKSZ, so you could not
accidentally start a standard build against the database.

Or you could just live with it. 10GB of disk is cheap insurance against
human error, and you're definitely taking some risk of human error if
you run a locally-patched build.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jason Newton

nevion@gmail.com

almost 12 years ago

In reply to: Tom Lane (#2)

Re: pg_largeobject high overhead

On Wed, Jun 4, 2014 at 9:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If your input data is uniformly incompressible, that's not too surprising.
pg_largeobject tuples hold BLCKSZ/4 bytes of data, plus some overhead, so
the only way that 4 of them will fit on a page is if compression saves
more than the overhead. You only need a couple percent compression
savings to make that work, but precompressed data might resist being
compressed even that much. So you end up with only 3 tuples per page
which makes it right about 33.33% overhead.

I have several filters compressing the data (shuffle, nbit, gzip). I

could just disable gzip. Would this result in more efficient storage (ie
less total GB used) beyond the per-block utilization bump? If that works,
that would be counter intuitive and use more bandwidth (which I'm ok with)
so it's a strange tradeoff.

According to the commentary in the source code (which you've evidently
read to some extent), a smaller block size was chosen to reduce the
overhead of partial updates of large-object data. You're probably not
going to do that, but PG doesn't know it. In any case, you can't choose
a LOBLKSIZE exceeding, or even equaling, the page size; so there's not
room for a huge change here. According to the above analysis, if you
want to pack more LO data per page, you'd actually be better off with
a *smaller* LOBLKSIZE to limit the per-page wastage.

I only looked at a header briefly :-). The rest was in the docs ML
http://www.postgresql.org/docs/9.3/static/catalog-pg-largeobject.html

What would likely be the best bet for you is to run a modified build with
LOBLKSIZE held at 2K and BLCKSZ bumped up to 32K. That would cut the
wastage from 2K per 8K to 2K per 32K. Operationally, it'd be safer since
there *is* a pg_control cross-check on BLCKSZ, so you could not
accidentally start a standard build against the database.

Or you could just live with it. 10GB of disk is cheap insurance against
human error, and you're definitely taking some risk of human error if
you run a locally-patched build.

regards, tom lane

The overhead is noticeable as storage for me is a little hard to come by
(budget limited R&D) but I respect that point. I think something should be
added in the future though to match os file storage efficiency given the
new lo limits are 4TB per entry - 30% is hefty overhead. Perhaps multiple
large object tables such that the user can change the respective tables
properties or whatever parameters needed being runtime specifiable to make
this more efficient / higher utilization.

I may experiment with some custom builds in the future just to see how it
goes.

-Jason