Physical append-only tables

Started by Magnus Haganderabout 9 years ago13 messages
#1Magnus Hagander
magnus@hagander.net

Scenario:

Large table, *mostly* insert+select only, either a sequence or a timestamp
column set to current time. This is an ideal usecase for a BRIN index.
Awesome.

But then consider the same table. Except rows are typically updated once or
twice when they are new, and *then* go read only. And we also have a
process that at some point deletes *some* old rows (but not all - in fact,
only a small portion).

In this case, the next INSERT once VACUUM has run is likely to stick a
"new" row somewhere very "far back" in the table, since there is now free
space there. This more or less completely ruins the BRIN index usability,
as the "old" blocks will now contain a single row from a "new" series.

For a scenario like this, would it make sense to have an option that could
be set on an individual table making it physical append only? Basically
VACUUM would run as normal and clean up the old space when rows are deleted
back in history, but when new space is needed for a row the system would
never look at the old blocks, and only append to the end.

Yes, this wastes space in the database. But not as much as trying to track
the deleted rows somewhere else and do an anti-join or something like that.
And it would obviously not be on by default.

Eventually the lost space might grow to a point where re-CLUSTERing the
table might be worthwhile. But given the cost of that, I can see many
scenarios where people are just willing to pay that extra overhead on large
tables that are more or less never deleted from.

I've run into a number of cases recently where this would've made the BRIN
indexes on huge tables *much* more efficient. At least, it seems to me they
would :)

Or am I missing something that would make this not work?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Magnus Hagander (#1)
Re: Physical append-only tables

On Mon, Nov 14, 2016 at 4:45 AM, Magnus Hagander <magnus@hagander.net> wrote:

For a scenario like this, would it make sense to have an option that could
be set on an individual table making it physical append only? Basically
VACUUM would run as normal and clean up the old space when rows are deleted
back in history, but when new space is needed for a row the system would
never look at the old blocks, and only append to the end.

One thing I have idly wondered about: if you had an append-only mode
that guarantees that we never write new tuples anywhere but the end of
the heap, you could create a bsearch access method that uses zero
storage and simply checks that every key inserted is >= the previous
high key before allowing the insertion to proceed. Then it could make
use of the guaranteed correlation with physical order to do scans
using a binary search of the heap. Maybe that'd be useful for some
kinds of write-only time-series data that needs to be searched by time
range. On the other hand, BRIN indexes are tiny, should be nearly as
good and are much less restrictive, so I haven't follow this thought
up.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Magnus Hagander (#1)
Re: Physical append-only tables

Magnus Hagander wrote:

But then consider the same table. Except rows are typically updated once or
twice when they are new, and *then* go read only. And we also have a
process that at some point deletes *some* old rows (but not all - in fact,
only a small portion).

In this case, the next INSERT once VACUUM has run is likely to stick a
"new" row somewhere very "far back" in the table, since there is now free
space there. This more or less completely ruins the BRIN index usability,
as the "old" blocks will now contain a single row from a "new" series.

Yeah. When we initially discussed BRIN, there was a mention of allowing
a BRIN index to guide new tuple location -- something like
auto-clustering, if you will. We haven't discussed the exact details
but I think something along those lines is worth considering.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Magnus Hagander
magnus@hagander.net
In reply to: Alvaro Herrera (#3)
Re: Physical append-only tables

On Mon, Nov 14, 2016 at 2:35 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Magnus Hagander wrote:

But then consider the same table. Except rows are typically updated once

or

twice when they are new, and *then* go read only. And we also have a
process that at some point deletes *some* old rows (but not all - in

fact,

only a small portion).

In this case, the next INSERT once VACUUM has run is likely to stick a
"new" row somewhere very "far back" in the table, since there is now free
space there. This more or less completely ruins the BRIN index usability,
as the "old" blocks will now contain a single row from a "new" series.

Yeah. When we initially discussed BRIN, there was a mention of allowing
a BRIN index to guide new tuple location -- something like
auto-clustering, if you will. We haven't discussed the exact details
but I think something along those lines is worth considering.

What I'm talking about is something that would be a lot simpler than
auto-clustering. I'm not even talking about trying to detect if the row was
about to go into the right place -- this might be expensive and certainly
more complicated. I'm only talking about a simple case where we *never* put
anything anywhere other than at the end of the table, period. That should
make the check both cheap and simple.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#4)
Re: Physical append-only tables

Magnus Hagander <magnus@hagander.net> writes:

What I'm talking about is something that would be a lot simpler than
auto-clustering. I'm not even talking about trying to detect if the row was
about to go into the right place -- this might be expensive and certainly
more complicated. I'm only talking about a simple case where we *never* put
anything anywhere other than at the end of the table, period. That should
make the check both cheap and simple.

It also makes it so much of a corner case that even a cheap check could be
a net performance degradation, especially for people whose usage pattern
doesn't match this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#5)
Re: Physical append-only tables

On Mon, Nov 14, 2016 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

What I'm talking about is something that would be a lot simpler than
auto-clustering. I'm not even talking about trying to detect if the row

was

about to go into the right place -- this might be expensive and certainly
more complicated. I'm only talking about a simple case where we *never*

put

anything anywhere other than at the end of the table, period. That should
make the check both cheap and simple.

It also makes it so much of a corner case that even a cheap check could be
a net performance degradation, especially for people whose usage pattern
doesn't match this.

I agree that it definitely solves just one problem. But it seems to be a
fairly common problem, particularly for users of BRIN users.

Full auto-clustering would cover many more usecases, but would also be a
lot more expensive to maintain.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#7Greg Stark
stark@mit.edu
In reply to: Magnus Hagander (#1)
Re: Physical append-only tables

On Sun, Nov 13, 2016 at 3:45 PM, Magnus Hagander <magnus@hagander.net> wrote:

For a scenario like this, would it make sense to have an option that could
be set on an individual table making it physical append only? Basically
VACUUM would run as normal and clean up the old space when rows are deleted
back in history, but when new space is needed for a row the system would
never look at the old blocks, and only append to the end.

I don't think "appending" is the right way to think about this. It
happens to address the problem but only accidentally and only
partially. More generally what you have is two different kinds of data
with two different access patterns and storage requirements in the
same table. They're logically similar but have different practical
requirements.

If there was some way to teach the database that your table is made of
two different types of data and how to distinguish the two types then
when the update occurs it could move the row to the right section of
storage... This might be something the new partitioning could handle
or it might need something more low-level and implicit.

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Magnus Hagander
magnus@hagander.net
In reply to: Greg Stark (#7)
Re: Physical append-only tables

On Mon, Nov 14, 2016 at 9:43 PM, Greg Stark <stark@mit.edu> wrote:

On Sun, Nov 13, 2016 at 3:45 PM, Magnus Hagander <magnus@hagander.net>
wrote:

For a scenario like this, would it make sense to have an option that

could

be set on an individual table making it physical append only? Basically
VACUUM would run as normal and clean up the old space when rows are

deleted

back in history, but when new space is needed for a row the system would
never look at the old blocks, and only append to the end.

I don't think "appending" is the right way to think about this. It
happens to address the problem but only accidentally and only
partially. More generally what you have is two different kinds of data
with two different access patterns and storage requirements in the
same table. They're logically similar but have different practical
requirements.

If there was some way to teach the database that your table is made of
two different types of data and how to distinguish the two types then
when the update occurs it could move the row to the right section of
storage... This might be something the new partitioning could handle
or it might need something more low-level and implicit.

Agreed, though in the cases I've looked at this has not been a static
thing. Some of it might be driven off dynamic data somewhere else, some of
it may be "this data has to be deleted for regulatory reasons". That can
show up on a case-by-case basis. I don't think the partitioning can really
be *that* flexible, though it might be able to pick up some of the issues.

The problem with the partitioning is also that it only works if you can
ensure the partitioning key is in every query, which often doesn't work out
against this.

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

In the use-cases I've had it's really the DELETE that's the problem. In all
those cases UPDATEs only happen on fairly recent data, so it doesn't really
screw with the BRIN. It's DELETEs of old data that's the big issue.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#9Bruce Momjian
bruce@momjian.us
In reply to: Greg Stark (#7)
Re: Physical append-only tables

On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space. Only if that
fails do we put is somewhere else in the heap.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#9)
Re: Physical append-only tables

On Thu, Nov 24, 2016 at 2:26 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space. Only if that
fails do we put is somewhere else in the heap.

That would certainly be useful. You'd have to figure out what to do in the
case of multiple conflicting BRIN indexes (which you shouldn't have in the
first place, but that won't keep people from having them), but other than
that it would be quite good I think.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#11Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#10)
Re: Physical append-only tables

On Thu, Nov 24, 2016 at 10:13:30AM +0100, Magnus Hagander wrote:

On Thu, Nov 24, 2016 at 2:26 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Nov 14, 2016 at 08:43:12PM +0000, Greg Stark wrote:

That said, I don't think the "maintain clustering a bit better using
BRIN" is a bad idea. It's just the bit about turning a table
append-only to deal with update-once data that I think is overreach.

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space.� Only if that
fails do we put is somewhere else in the heap.

That would certainly be useful. You'd have to figure out what to do in the case
of multiple conflicting BRIN indexes (which you shouldn't have in the first
place, but that won't keep people from having them), but other than that it
would be quite good I think.�

This idea is only possible because the BRIN index is so small and easy
to scan, i.e. this wouldn't work for a btree index.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#11)
Re: Physical append-only tables

On 11/24/16 8:18 AM, Bruce Momjian wrote:

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space. Only if that
fails do we put is somewhere else in the heap.

That would certainly be useful. You'd have to figure out what to do in the case
of multiple conflicting BRIN indexes (which you shouldn't have in the first
place, but that won't keep people from having them), but other than that it
would be quite good I think.

This idea is only possible because the BRIN index is so small and easy
to scan, i.e. this wouldn't work for a btree index.

ISTM a prerequisite for any of this is a way to override the default FSM
behavior. A simple strategy that forces append-only would presumably be
very cheap and easy to do after that. It could also be used to allow
better clustering. It would also make it far easier to recover from a
heavily bloated table that's too large to simply VACUUM FULL or CLUSTER,
without resorting to the contortions that pg_repack/pg_reorg have to.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jim Nasby (#12)
Re: Physical append-only tables

On Mon, Nov 28, 2016 at 9:05 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 11/24/16 8:18 AM, Bruce Momjian wrote:

What if we used BRIN to find heap pages where the new row was in the
page BRIN min/max range, and the heap page had free space. Only if
that
fails do we put is somewhere else in the heap.

That would certainly be useful. You'd have to figure out what to do in
the case
of multiple conflicting BRIN indexes (which you shouldn't have in the
first
place, but that won't keep people from having them), but other than that
it
would be quite good I think.

This idea is only possible because the BRIN index is so small and easy
to scan, i.e. this wouldn't work for a btree index.

ISTM a prerequisite for any of this is a way to override the default FSM
behavior. A simple strategy that forces append-only would presumably be very
cheap and easy to do after that. It could also be used to allow better
clustering. It would also make it far easier to recover from a heavily
bloated table that's too large to simply VACUUM FULL or CLUSTER, without
resorting to the contortions that pg_repack/pg_reorg have to.

Since building BRIN index doesn't take a long time, it would be good
enough to support the online-clustering (or clustering with minimal
lock like pg_repack/pg_reorg does) in most cases. And I'm not sure
that there are a lot of users who build only BRIN index on the table.
I think that many users want to build other indexes on same table
other columns.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers