Preventing free space from being reused

Started by Noah Bergbauerabout 5 years ago5 messageshackers
Jump to latest
#1Noah Bergbauer
noah@statshelix.com

Hello,

I am working on a project where I do not want Postgres to reuse free space
in old pages (see
/messages/by-id/CABjy+RhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ@mail.gmail.com
for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes this.
For a long-term solution I see two options:

1. Introduce a reloption for this.
2. Implement it as a custom table access method in an extension.

As an experiment, I have created an extension which forwards all table
access functions to the builtin heap access method, but enables the
HEAP_INSERT_SKIP_FSM flag for heap_insert and heap_multi_insert. However,
the check in heap_getnext (
https://github.com/postgres/postgres/blob/REL_12_5/src/backend/access/heap/heapam.c#L1294-L1304)
is a showstopper. Because the custom access method uses a different
function table (so that I can override heap_insert and heap_multi_insert),
heap_getnext errors out with "only heap AM is supported". I am currently
hacking around this problem by duplicating all code up to and including
heap_getnext, with this check commented out. Clearly this is not ideal, as
changes to the heap code in future updates might cause incompatibilities.

Any ideas on how to proceed with this issue?

Thank you,
Noah Bergbauer

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Bergbauer (#1)
Re: Preventing free space from being reused

Noah Bergbauer <noah@statshelix.com> writes:

I am working on a project where I do not want Postgres to reuse free space
in old pages (see
/messages/by-id/CABjy+RhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ@mail.gmail.com
for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes this.
For a long-term solution I see two options:
1. Introduce a reloption for this.
2. Implement it as a custom table access method in an extension.

TBH, I can't believe that this is actually a good idea. If we introduce
a reloption that does that, we'll just be getting users complaining about
table bloat ... but probably only after they get to a state where it's
going to be horribly painful to get out of.

(My reaction to your previous thread was that it was simply a question
of blindly insisting on using BRIN indexes for a case that they're quite
badly adapted to. The better answer is to not use BRIN.)

regards, tom lane

#3Noah Bergbauer
noah@statshelix.com
In reply to: Tom Lane (#2)
Re: Preventing free space from being reused

(My reaction to your previous thread was that it was simply a question
of blindly insisting on using BRIN indexes for a case that they're quite
badly adapted to. The better answer is to not use BRIN.)

Apologies, perhaps I am completely misunderstanding the motivation for BRIN?

From the docs:

BRIN is designed for handling very large tables in which certain columns

have some natural correlation with their physical location within the table.

[...]
a table storing a store's sale orders might have a date column on which

each order was placed, and most of the time the entries for earlier orders
will appear earlier in the table

My table is very large, and the column in question has a strong natural
correlation with each tuple's physical location. It is, in fact, a date
column where entries with earlier timestamps will appear earlier in the
table. To be honest, if this isn't a use case for BRIN, then I don't know
what is. The only exception to this is a small proportion of tuples which
are slotted into random older pages due to their small size.

A btree index on the same column is 700x the size of BRIN, or 10% of
relation itself. It does not perform significantly better than BRIN. The
issue here is twofold: not only does slotting these tuples into older pages
significantly reduce the effectiveness of BRIN, it also causes
fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to
look at this situation is that my data is inserted exactly in index order,
but Postgres keeps un-clustering it for reasons that are valid in general
(don't waste disk space) but don't apply at all in this case (the file
system uses compression, no space is wasted).

Any alternative ideas would of course be much appreciated! But at the
moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me.

On Fri, Feb 12, 2021 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Noah Bergbauer <noah@statshelix.com> writes:

I am working on a project where I do not want Postgres to reuse free

space

in old pages (see

/messages/by-id/CABjy+RhbFu_Hs8ZEiOzaPaJSGB9jqFF0gDU5gtwCLiurG3NLjQ@mail.gmail.com

for details). I found that the HEAP_INSERT_SKIP_FSM flag accomplishes

this.

For a long-term solution I see two options:
1. Introduce a reloption for this.
2. Implement it as a custom table access method in an extension.

TBH, I can't believe that this is actually a good idea. If we introduce
a reloption that does that, we'll just be getting users complaining about
table bloat ... but probably only after they get to a state where it's
going to be horribly painful to get out of.

(My reaction to your previous thread was that it was simply a question
of blindly insisting on using BRIN indexes for a case that they're quite
badly adapted to. The better answer is to not use BRIN.)

regards, tom lane

#4John Naylor
john.naylor@enterprisedb.com
In reply to: Noah Bergbauer (#3)
Re: Preventing free space from being reused

On Fri, Feb 12, 2021 at 6:21 PM Noah Bergbauer <noah@statshelix.com> wrote:

A btree index on the same column is 700x the size of BRIN, or 10% of

relation itself. It does not perform significantly better than BRIN. The
issue here is twofold: not only does slotting these tuples into older pages
significantly reduce the effectiveness of BRIN, it also causes
fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to
look at this situation is that my data is inserted exactly in index order,
but Postgres keeps un-clustering it for reasons that are valid in general
(don't waste disk space) but don't apply at all in this case (the file
system uses compression, no space is wasted).

Any alternative ideas would of course be much appreciated! But at the

moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me.

I would suggest to take a look at the BRIN opclass multi-minmax currently
in development. It's designed to address that exact situation, and more
review would be welcome:

https://commitfest.postgresql.org/32/2523/

--
John Naylor
EDB: http://www.enterprisedb.com

#5Noah Bergbauer
noah@statshelix.com
In reply to: John Naylor (#4)
Re: Preventing free space from being reused

I would suggest to take a look at the BRIN opclass multi-minmax currently

in development.

Thank you, this does look like it could help a lot with BRIN performance in
this situation!

But again, if index performance alone was the only issue, then I would
simply accept the space overhead and switch to btree. However, the disk
fragmentation issue still remains and is significant. It is also amplified
in my use case due to using ZFS, mostly for compression. But it is worth
it: I am currently observing a 13x compression ratio (when comparing disk
space reported by du and select sum(octet_length(x)), so this does not
include the false gains from compressing padding). But in general, any
variable-sized append-only workload suffers from this fragmentation
problem. It's just that with filesystem compression, there is no longer a
good reason to fill up those holes and accept the fragmentation.

To be clear, the main reason why I even brought my questions to this
mailing list is that I don't know how to (correctly) get past the check in
heap_getnext (see my first email) when implementing the workaround as a
custom table access method. A reloption could theoretically disable free
space maps entirely for some added efficiency, but I'm inclined to agree
that this is not really needed.

On Sat, Feb 13, 2021 at 1:36 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

Show quoted text

On Fri, Feb 12, 2021 at 6:21 PM Noah Bergbauer <noah@statshelix.com>
wrote:

A btree index on the same column is 700x the size of BRIN, or 10% of

relation itself. It does not perform significantly better than BRIN. The
issue here is twofold: not only does slotting these tuples into older pages
significantly reduce the effectiveness of BRIN, it also causes
fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to
look at this situation is that my data is inserted exactly in index order,
but Postgres keeps un-clustering it for reasons that are valid in general
(don't waste disk space) but don't apply at all in this case (the file
system uses compression, no space is wasted).

Any alternative ideas would of course be much appreciated! But at the

moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me.

I would suggest to take a look at the BRIN opclass multi-minmax currently
in development. It's designed to address that exact situation, and more
review would be welcome:

https://commitfest.postgresql.org/32/2523/

--
John Naylor
EDB: http://www.enterprisedb.com