extensible external toast tuple support

Started by Andres Freundalmost 13 years ago37 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

In
http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de
I presented the need for 'indirect' toast tuples which point into memory
instead of a toast table. In the comments to that proposal, off-list and
in-person talks the wish to make that a more general concept has
been voiced.

The previous patch used varattrib_1b_e.va_len_1be to discern between
different types of external tuples. That obviously only works if the
data sizes of all possibly stored datum types are distinct which isn't
nice. So what the newer patch now does is to rename that field into
'va_tag' and decide based on that what kind of Datum we have. To get the
actual length of that datum there now is a VARTAG_SIZE() macro which
maps the tags back to size.
To keep on-disk compatibility the size of an external toast tuple
containing a varatt_external is used as its tag value.

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

FWIW, in some quick microbenchmarks I couldn't find any performance
difference due to the slightly more complex size computation which I do
*not* find surprising.

Opinions?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-support-for-multiple-kinds-of-external-toast-dat.patchtext/x-patch; charset=us-asciiDownload+183-33
#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: extensible external toast tuple support

On Thu, May 30, 2013 at 7:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:

In
http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de
I presented the need for 'indirect' toast tuples which point into memory
instead of a toast table. In the comments to that proposal, off-list and
in-person talks the wish to make that a more general concept has
been voiced.

The previous patch used varattrib_1b_e.va_len_1be to discern between
different types of external tuples. That obviously only works if the
data sizes of all possibly stored datum types are distinct which isn't
nice. So what the newer patch now does is to rename that field into
'va_tag' and decide based on that what kind of Datum we have. To get the
actual length of that datum there now is a VARTAG_SIZE() macro which
maps the tags back to size.
To keep on-disk compatibility the size of an external toast tuple
containing a varatt_external is used as its tag value.

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

FWIW, in some quick microbenchmarks I couldn't find any performance
difference due to the slightly more complex size computation which I do
*not* find surprising.

Opinions?

Seems pretty sensible to me. The patch is obviously WIP but the
direction seems fine to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: extensible external toast tuple support & snappy prototype

On 2013-05-31 23:42:51 -0400, Robert Haas wrote:

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

Seems pretty sensible to me. The patch is obviously WIP but the
direction seems fine to me.

So, I played a bit more with this, with an eye towards getting this into
a non WIP state, but: While I still think the method for providing
indirect external Datum support is fine, I don't think my sketch for
providing extensible compression is.

As mentioned upthread we also have compressed datums inline as
VARATT_4B_C datums. The way toast_insert_or_update() is that when it
finds it needs to shrink a Datum it tries to compress it *inline* and
only if that still is to big it gets stored out of line. Changing that
doesn't sound like a good idea since it a) would make an already
complicated function even more complicated and b) would likely make the
whole thing slower since we would frequently compress with two different
methods.
So I think for compressed tuples we need an independent trick that also
works for inline compressed tuples:

The current way 4B_C datums work is that they are basically a normal 4B
Datum (but discernible by a different bit in the non-length part of the
length). Such compressed Datums store the uncompressed length of a Datum
in its first 4 bytes. Since we cannot have uncompressed Datums longer
than 1GB due to varlena limitations 2 bits in that lenght are free to
discern different compression algorithms.
So what my (absolutely prototype) patch does is to use those two bits to
discern different compression algorithms. Currently it simply assumes
that '00' is pglz while '01' is snappy-c. That would leave us with two
other possible algorithms ('11' and '10'), but we could easily enough
extend that to more algorithms if we want by not regarding the first 4
bytes as a length word but as the compression algorithm indicator if the
two high bits are set.

So, before we go even more into details here are some benchmark results
based on playing with a partial dump (1.1GB) of the public pg mailing
list archives (Thanks Alvaro!):

BEGIN;
SET toast_compression_algo = 0; -- pglz
CREATE TABLE messages ( ... );
\i ~/tmp/messages.sane.dump
Time: 43053.786 ms
ALTER TABLE messages RENAME TO messages_pglz;
COMMIT;

BEGIN;
SET toast_compression_algo = 1; -- snappy
CREATE TABLE messages ( ... );
\i ~/tmp/messages.sane.dump
Time: 30340.210 ms
ALTER TABLE messages RENAME TO messages_snappy;
COMMIT;

postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-----------------+-------+--------+--------+-------------
public | messages_pglz | table | andres | 526 MB |
public | messages_snappy | table | andres | 523 MB |

Ok, so while the data size didn't change all that much the compression
was quite noticeably faster. With snappy the most visible bottleneck is
COPY not compression although it's still in the top 3 functions...

So what about data reading?

postgres=# COPY messages_pglz TO '/dev/null' WITH BINARY;
COPY 86953
Time: 3825.241 ms
postgres=# COPY messages_snappy TO '/dev/null' WITH BINARY;
COPY 86953
Time: 3674.844 ms

Ok, so here the performance difference is relatively small. Turns out
that's because most of the time is spent in the output routines, even
though we are using BINARY mode. tsvector_send is expensive.

postgres=# COPY (SELECT rawtxt FROM messages_pglz) TO '/dev/null' WITH BINARY;
COPY 86953
Time: 2180.512 ms
postgres=# COPY (SELECT rawtxt FROM messages_snappy) TO '/dev/null' WITH BINARY;
COPY 86953
Time: 1782.810

Ok, so here the benefits are are already nicer.

Imo this shows that using a different compression algorithm is quite a
good idea.

Important questions are:
1) Which algorithms do we want? I think snappy is a good candidate but I
mostly chose it because it's BSD licenced, widely used, and has a C
implementation with a useable API. LZ4 might be another interesting
choice. Another slower algorithm with higher compression ratio
would also be a good idea for many applications.
2) Do we want to build infrastructure for more than 3 compression
algorithms? We could delay that decision till we add the 3rd.
3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?
4) The prototype removed knowledge about the internals of compression
from postgres.h which imo is a good idea, but that is debatable.
5) E.g. snappy stores the uncompressed length internally as a varint,
but I don't think there's a way to benefit from that on little endian
machines since the highbits we use to discern from pglz are actually
stored 4 bytes in...

Two patches attached:
1) add snappy to src/common. The integration needs some more work.
2) Combined patch that adds indirect tuple and snappy compression. Those
coul be separated, but this is an experiment so far...

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-snappy-compression-algorithm-to-contrib.patchtext/x-patch; charset=us-asciiDownload+1701-2
0002-Add-support-for-multiple-kinds-of-external-toast-dat.patchtext/x-patch; charset=us-asciiDownload+381-102
#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: extensible external toast tuple support & snappy prototype

On Wed, Jun 5, 2013 at 11:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-31 23:42:51 -0400, Robert Haas wrote:

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

Seems pretty sensible to me. The patch is obviously WIP but the
direction seems fine to me.

So, I played a bit more with this, with an eye towards getting this into
a non WIP state, but: While I still think the method for providing
indirect external Datum support is fine, I don't think my sketch for
providing extensible compression is.

I didn't really care about doing (and don't really want to do) both
things in the same patch. I just didn't want the patch to shut the
door to extensible compression in the future.

Important questions are:
1) Which algorithms do we want? I think snappy is a good candidate but I
mostly chose it because it's BSD licenced, widely used, and has a C
implementation with a useable API. LZ4 might be another interesting
choice. Another slower algorithm with higher compression ratio
would also be a good idea for many applications.

I have no opinion on this.

2) Do we want to build infrastructure for more than 3 compression
algorithms? We could delay that decision till we add the 3rd.

I think we should leave the door open, but I don't have a compelling
desire to get too baroque for v1. Still, maybe if the first byte has
a 1 in the high-bit, the next 7 bits should be defined as specifying a
compression algorithm. 3 compression algorithms would probably last
us a while; but 127 should last us, in effect, forever.

3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?

The way we do caching right now supposes that attoptions will be
needed only occasionally. It might need to be revised if we're going
to need it all the time. Or else we might need to use a dedicated
pg_class column.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#4)
Re: extensible external toast tuple support & snappy prototype

On 2013-06-07 10:04:15 -0400, Robert Haas wrote:

On Wed, Jun 5, 2013 at 11:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-31 23:42:51 -0400, Robert Haas wrote:

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

Seems pretty sensible to me. The patch is obviously WIP but the
direction seems fine to me.

So, I played a bit more with this, with an eye towards getting this into
a non WIP state, but: While I still think the method for providing
indirect external Datum support is fine, I don't think my sketch for
providing extensible compression is.

I didn't really care about doing (and don't really want to do) both
things in the same patch. I just didn't want the patch to shut the
door to extensible compression in the future.

Oh. I don't want to actually commit it in the same patch either. But to
keep the road for extensible compression open we kinda need to know what
the way to do that is. Turns out it's an independent thing that doesn't
reuse any of the respective infrastructures.

I only went so far to actually implement the compression because a) my
previous thoughts about how it could work were bogus b) it was fun.

Turns out the benefits are imo big enough to make it worth pursuing
further.

2) Do we want to build infrastructure for more than 3 compression
algorithms? We could delay that decision till we add the 3rd.

I think we should leave the door open, but I don't have a compelling
desire to get too baroque for v1. Still, maybe if the first byte has
a 1 in the high-bit, the next 7 bits should be defined as specifying a
compression algorithm. 3 compression algorithms would probably last
us a while; but 127 should last us, in effect, forever.

The problem is that to discern from pglz on little endian the byte with
the two high bits unset is actually the fourth byte in a toast datum. So
we would need to store it in the 5th byte or invent some more
complicated encoding scheme.

So I think we should just define '00' as pglz, '01' as xxx, '10' as yyy
and '11' as storing the schema in the next byte.

3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?

The way we do caching right now supposes that attoptions will be
needed only occasionally. It might need to be revised if we're going
to need it all the time. Or else we might need to use a dedicated
pg_class column.

Good point. It probably belongs right besides attstorage, seems to be
the most consistent choice anyway.

Alternatively, if we only add one form of compression, we can just
always store in snappy/lz4/....

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: extensible external toast tuple support & snappy prototype

On Fri, Jun 7, 2013 at 10:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Turns out the benefits are imo big enough to make it worth pursuing
further.

Yeah, those were nifty numbers.

The problem is that to discern from pglz on little endian the byte with
the two high bits unset is actually the fourth byte in a toast datum. So
we would need to store it in the 5th byte or invent some more
complicated encoding scheme.

So I think we should just define '00' as pglz, '01' as xxx, '10' as yyy
and '11' as storing the schema in the next byte.

Not totally following, but I'm fine with that.

3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?

The way we do caching right now supposes that attoptions will be
needed only occasionally. It might need to be revised if we're going
to need it all the time. Or else we might need to use a dedicated
pg_class column.

Good point. It probably belongs right besides attstorage, seems to be
the most consistent choice anyway.

Possibly, we could even store it in attstorage. We're really only
using two bits of that byte right now, so just invent some more
letters.

Alternatively, if we only add one form of compression, we can just
always store in snappy/lz4/....

Not following.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: extensible external toast tuple support & snappy prototype

On 2013-06-07 10:44:24 -0400, Robert Haas wrote:

On Fri, Jun 7, 2013 at 10:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Turns out the benefits are imo big enough to make it worth pursuing
further.

Yeah, those were nifty numbers.

The problem is that to discern from pglz on little endian the byte with
the two high bits unset is actually the fourth byte in a toast datum. So
we would need to store it in the 5th byte or invent some more
complicated encoding scheme.

So I think we should just define '00' as pglz, '01' as xxx, '10' as yyy
and '11' as storing the schema in the next byte.

Not totally following, but I'm fine with that.

Currently on a little endian system the pglz header contains the length
in the first four bytes as:
[dddddddd][dddddddd][dddddddd][xxdddddd]
Where dd are valid length bits for pglz and xx are the two bits which
are always zero since we only ever store up to 1GB. We can redefine 'xx'
to mean whatever we want but we cannot change it's place.

3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?

The way we do caching right now supposes that attoptions will be
needed only occasionally. It might need to be revised if we're going
to need it all the time. Or else we might need to use a dedicated
pg_class column.

Good point. It probably belongs right besides attstorage, seems to be
the most consistent choice anyway.

Possibly, we could even store it in attstorage. We're really only
using two bits of that byte right now, so just invent some more
letters.

Hm. Possible, but I don't think that's worth it. There's a padding byte
before attinhcount anyway.
Storing the preferred location in attstorage (plain, preferred-internal,
external, preferred-external) separately from the compression seems to
make sense to me.

Alternatively, if we only add one form of compression, we can just
always store in snappy/lz4/....

Not following.

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#7)
Re: extensible external toast tuple support & snappy prototype

On 06/07/2013 04:54 PM, Andres Freund wrote:

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

As DE-comression is often still fast for slow-but-good compression,
the obvious use case for 2nd is read-mostly data

Greetings,

Andres Freund

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#8)
Re: extensible external toast tuple support & snappy prototype

On 2013-06-07 17:27:28 +0200, Hannu Krosing wrote:

On 06/07/2013 04:54 PM, Andres Freund wrote:

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

As DE-comression is often still fast for slow-but-good compression,
the obvious use case for 2nd is read-mostly data

Well. Those algorithms still are up to magnitude or so slower
decompressing than something like snappy, lz4 or even pglz while the
compression ratio usually is only like 50-80% improved... So you really
need a good bit of compressible data (so the amount of storage actually
hurts) that you don't read all that often (since you then would
bottleneck on compression too often).
That's just not something I run across to regularly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: extensible external toast tuple support & snappy prototype

Andres Freund <andres@2ndquadrant.com> writes:

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

IME, once we've changed it once, the odds that we'll want to change it
again go up drastically. I think if we're going to make a change here
we should leave room for future revisions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: extensible external toast tuple support & snappy prototype

On 2013-06-07 11:46:45 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

IME, once we've changed it once, the odds that we'll want to change it
again go up drastically. I think if we're going to make a change here
we should leave room for future revisions.

The above bit was just about how much control we give the user over the
compression algorithm used for compressing new data. If we just add one
method atm which we think is just about always better than pglz there's
not much need to provide the tunables already.

I don't think there's any question over the fact that we should leave
room on the storage level to reasonably easy add new compression
algorithms without requiring on-disk changes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#3)
Re: extensible external toast tuple support & snappy prototype

Andres Freund escribi�:

2) Combined patch that adds indirect tuple and snappy compression. Those
coul be separated, but this is an experiment so far...

Can we have a separate header for toast definitions? (i.e. split them
out of postgres.h)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: extensible external toast tuple support & snappy prototype

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-06-07 11:46:45 -0400, Tom Lane wrote:

IME, once we've changed it once, the odds that we'll want to change it
again go up drastically. I think if we're going to make a change here
we should leave room for future revisions.

The above bit was just about how much control we give the user over the
compression algorithm used for compressing new data. If we just add one
method atm which we think is just about always better than pglz there's
not much need to provide the tunables already.

Ah, ok, I thought you were talking about storage-format decisions not
about whether to expose a tunable setting.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#9)
Re: extensible external toast tuple support & snappy prototype

On 06/07/2013 05:38 PM, Andres Freund wrote:

On 2013-06-07 17:27:28 +0200, Hannu Krosing wrote:

On 06/07/2013 04:54 PM, Andres Freund wrote:

I mean, we don't necessarily need to make it configurable if we just add
one canonical new "better" compression format. I am not sure that's
sufficient since I can see usecases for 'very fast but not too well
compressed' and 'very well compressed but slow', but I am personally not
really interested in the second case, so ...

As DE-comression is often still fast for slow-but-good compression,
the obvious use case for 2nd is read-mostly data

Well. Those algorithms still are up to magnitude or so slower
decompressing than something like snappy, lz4 or even pglz while the
compression ratio usually is only like 50-80% improved... So you really
need a good bit of compressible data (so the amount of storage actually
hurts) that you don't read all that often (since you then would
bottleneck on compression too often).
That's just not something I run across to regularly.

While the difference in compression speeds between algorithms is
different, it may be more then offset in favour of better compression
if there is real I/O involved as exemplified here:
http://www.citusdata.com/blog/64-zfs-compression

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#7)
Re: extensible external toast tuple support & snappy prototype

* Andres Freund (andres@2ndquadrant.com) wrote:

Currently on a little endian system the pglz header contains the length
in the first four bytes as:
[dddddddd][dddddddd][dddddddd][xxdddddd]
Where dd are valid length bits for pglz and xx are the two bits which
are always zero since we only ever store up to 1GB. We can redefine 'xx'
to mean whatever we want but we cannot change it's place.

I'm not thrilled with the idea of using those 2 bits from the length
integer. I understand the point of it and that we'd be able to have
binary compatibility from it but is it necessary to track at the
per-tuple level..? What about possibly supporting >1GB objects at some
point (yes, I know there's a lot of other issues there, but still).
We've also got complexity around the size of the length integer already.

Anyway, just not 100% sure that we really want to use these bits for
this.

Thanks,

Stephen

#16Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#15)
Re: extensible external toast tuple support & snappy prototype

On 2013-06-07 12:16:48 -0400, Stephen Frost wrote:

* Andres Freund (andres@2ndquadrant.com) wrote:

Currently on a little endian system the pglz header contains the length
in the first four bytes as:
[dddddddd][dddddddd][dddddddd][xxdddddd]
Where dd are valid length bits for pglz and xx are the two bits which
are always zero since we only ever store up to 1GB. We can redefine 'xx'
to mean whatever we want but we cannot change it's place.

I'm not thrilled with the idea of using those 2 bits from the length
integer. I understand the point of it and that we'd be able to have
binary compatibility from it but is it necessary to track at the
per-tuple level..? What about possibly supporting >1GB objects at some
point (yes, I know there's a lot of other issues there, but still).
We've also got complexity around the size of the length integer already.

I am open for different suggestions, but I don't know of any realistic
ones.
Note that the 1GB limitation is already pretty heavily baked in into
varlenas itself (which is not the length we are talking about here!)
since we use the two remaining bits discern between the 4 types of
varlenas we have.

* short (1b)
* short, pointing to a ondisk tuple (1b_e)
* long (4b)
* long compressed (4b_c)

Since long compressed ones always need to be convertible to long ones we
can't ever have a 'rawsize' (which is what's proposed to be used for
this) that's larger than 1GB.

So, breaking the 1GB limit will not be stopped by this, but much much
earlier. And it will require a break in on-disk compatibility.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: extensible external toast tuple support

On 2013-05-31 23:42:51 -0400, Robert Haas wrote:

On Thu, May 30, 2013 at 7:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:

In
http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de
I presented the need for 'indirect' toast tuples which point into memory
instead of a toast table. In the comments to that proposal, off-list and
in-person talks the wish to make that a more general concept has
been voiced.

The previous patch used varattrib_1b_e.va_len_1be to discern between
different types of external tuples. That obviously only works if the
data sizes of all possibly stored datum types are distinct which isn't
nice. So what the newer patch now does is to rename that field into
'va_tag' and decide based on that what kind of Datum we have. To get the
actual length of that datum there now is a VARTAG_SIZE() macro which
maps the tags back to size.
To keep on-disk compatibility the size of an external toast tuple
containing a varatt_external is used as its tag value.

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

FWIW, in some quick microbenchmarks I couldn't find any performance
difference due to the slightly more complex size computation which I do
*not* find surprising.

Opinions?

Seems pretty sensible to me. The patch is obviously WIP but the
direction seems fine to me.

Here's the updated version. It shouldn't contain any obvious WIP pieces
anymore, although I think it needs some more documentation. I am just
not sure where to add it yet, postgres.h seems like a bad place :/

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-support-for-multiple-kinds-of-external-toast-dat.patchtext/x-patch; charset=us-asciiDownload+153-33
#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#17)
Re: extensible external toast tuple support

Andres Freund escribi�:

Here's the updated version. It shouldn't contain any obvious WIP pieces
anymore, although I think it needs some more documentation. I am just
not sure where to add it yet, postgres.h seems like a bad place :/

How about a new file, say src/include/access/toast.h?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#18)
Re: extensible external toast tuple support

On 2013-06-14 19:14:15 -0400, Alvaro Herrera wrote:

Andres Freund escribió:

Here's the updated version. It shouldn't contain any obvious WIP pieces
anymore, although I think it needs some more documentation. I am just
not sure where to add it yet, postgres.h seems like a bad place :/

How about a new file, say src/include/access/toast.h?

Well, the question is if that buys us all that much, we need the varlena
definitions to be available pretty much everywhere. Except of section 3
- which we reduced to be pretty darn small these days - of postgres.h
pretty much all of it is concerned with Datums, a good of them being
varlenas.
We could move section 1) into its own file and unconditionally include
it...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Andres Freund (#17)
Re: extensible external toast tuple support

On Fri, Jun 14, 2013 at 4:06 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Here's the updated version. It shouldn't contain any obvious WIP pieces
anymore, although I think it needs some more documentation. I am just
not sure where to add it yet, postgres.h seems like a bad place :/

I have a few comments and questions after reviewing this patch.

- heap_tuple_fetch_attr doesn't need to be updated to reflect ONDISK macro?
- I'm not sure if plural for datum is good to use. Datum values?
- -1 from me to use enum for tag types, as I don't think it needs to be.
This looks more like other "kind" macros such as relkind, but I know
there's pros/cons
- don't we need cast for tag value comparison in VARTAG_SIZE macro, since
tag is unit8 and enum is signed int?
- Is there better way to represent ONDISK size, instead of magic number
18? I'd suggest constructing it with sizeof(varatt_external).

Other than that, the patch applies fine and make check runs, though I don't
think the new code path is exercised well, as no one is creating INDIRECT
datum yet.

Also, I wonder how we could add more compression in datum, as well as how
we are going to add more compression options in database. I'd love to see
pluggability here, as surely the core cannot support dozens of different
compression algorithms, but because the data on disk is critical and we
cannot do anything like user defined functions. The algorithms should be
optional builtin so that once the system is initialized the the plugin
should not go away. Anyway pluggable compression is off-topic here.

Thanks,
--
Hitoshi Harada

#21Andres Freund
andres@anarazel.de
In reply to: Hitoshi Harada (#20)
#22Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Andres Freund (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Hitoshi Harada (#22)
#24Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Andres Freund (#3)
#25Andres Freund
andres@anarazel.de
In reply to: Hitoshi Harada (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#25)
#27Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Robert Haas (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#29Andres Freund
andres@anarazel.de
In reply to: Hitoshi Harada (#27)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#34)
#36Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#36)