Improving vacuum/VM/etc

Started by Jim Nasbyover 10 years ago4 messages
#1Jim Nasby
Jim.Nasby@BlueTreble.com

I mentioned this idea in the "other"[1]/messages/by-id/20140912135413.GK4701@eldon.alvh.no-ip.org vacuum thread [2]/messages/by-id/2011829201.2201963.1429726992897.JavaMail.yahoo@mail.yahoo.com, but I think
it got lost.

Kevin Grittner pointed out that there's a potentially huge number of
writes we incur over the life of a tuple [3]/messages/by-id/771351984.2266772.1429728671811.JavaMail.yahoo@mail.yahoo.com:

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

He mentioned that a lot of these writes could be combined if they
happened close enough together. We can further add an all-visible state
in at 3.5.

Instead of simply adding all-frozen information to the VM we could
instead store 4 different page states and potentially improve a lot of
different cleanup woes at one time.

Unfortunately, the states I came up with using existing semantics don't
look hugely useful[4]1a: All-visible What we have today. Page still needs to be visited for freeze, but has no newly inserted nor newly deleted tuples., but if we take Robert's idea and make all-visible
mean all-frozen, we can do much better:

0: Newly inserted tuples
Tracking this state allows us to aggressively set hint bits.

1: Newly deleted
There are tuples that have been deleted but not pruned. There may also
be newly inserted tuples that need hinting (state 0).

Similar to state 0, we'd want to be fairly aggressive with these pages,
because as soon as the deleting XID is committed and older than all
snapshots we can prune. Because we can prune without hitting indexes,
this is still a fairly cheap operation, though not as cheap as 0.

2: Fully hinted, not frozen
This is the really painful state to clean up, because we have to deal
with indexes. We must enter this state after being in 1.

3: All-visible-frozen
Every tuple on the page is visible and frozen. Pages in this state need
no maintenance at all. We might be able to enter this state directly
from state 0.

BENEFITS
This tracking should help at least 3 problems: the need to set hint bits
after insert, SELECT queries doing pruning (Simon's recent complaint),
and needing to scan an entire table for freezing.

The improvement in hinting and pruning is based on the idea that
normally there would not be a lot of pages in state 0 or 1, and pages
that were in those states are very likely to still be in disk cache (if
not shared buffers). That means we can have a background process (or 2)
that is very aggressive at targeting pages in these states.

Not needing to scan everything that's frozen is thanks to state 3. I
think it's OK (at least for now) if only vacuum puts pages into this
state, which means it can actually freeze the tuples when it does it
(thanks to 37484ad we won't lose forensic data doing this). That means
there's no extra work necessary by a foreground process that's dirtying
a page.

Because of 37484ad, I think as part of this we should also deprecate
vacuum_freeze_min_age, or at least change it's behavior. AFAIK the only
objection to aggressive freezing was loss of forensic data, and that's
gone now. So vacuum (and presumably the bg process(es) than handle state
0 and 1) should freeze tuples if it would allow the whole page to be
frozen. Possibly it should just do it any time it's dirtying the page.
(We could actually do this right now; it would let us eliminate the GUC,
but I'm not sure there'd be other benefit without the rest of this.)

DOWNSIDES
This does mean doubling the size of the VM. It would still be 32,000
times smaller than the heap with 8k pages (and 128,000 times smaller
with the common warehouse 32k page size), so I suspect this is a
non-issue, but it's worth mentioning. It might have some effect on a
almost entirely read-only system; but I suspect in most other cases the
other benefits will outweigh this.

This approach still does nothing to help the index related activity in
vacuum. My gut says state 2 should be further split; but I'm not sure
why. Perhaps if we had another state we could do something more
intelligent with index cleanup...

This might put a lot more read pressure on the VMs. We might want some
way to summarize per-table VMs (or ranges of VMs) so that we're not
constantly scanning them.

We'd still have to freeze, as opposed to what might be possible with
XID-LSN. OTOH, most of the changes to do this would be limited to
current VM code and callers. I don't think vacuum itself would need a
lot of changes, and I hope the BG code for state 0/1 would be that
complicated; it shouldn't need the complexity of autovacuum or vacuum.
So this should be much lower risk than something like XID-LSN.

So... what am I missing? :)

[1]: /messages/by-id/20140912135413.GK4701@eldon.alvh.no-ip.org
/messages/by-id/20140912135413.GK4701@eldon.alvh.no-ip.org
[2]: /messages/by-id/2011829201.2201963.1429726992897.JavaMail.yahoo@mail.yahoo.com
/messages/by-id/2011829201.2201963.1429726992897.JavaMail.yahoo@mail.yahoo.com
[3]: /messages/by-id/771351984.2266772.1429728671811.JavaMail.yahoo@mail.yahoo.com
/messages/by-id/771351984.2266772.1429728671811.JavaMail.yahoo@mail.yahoo.com

[4]: 1a: All-visible What we have today. Page still needs to be visited for freeze, but has no newly inserted nor newly deleted tuples.
1a: All-visible
What we have today. Page still needs to be visited for freeze, but has
no newly inserted nor newly deleted tuples.

2a: All-frozen
Not only is the page all-visible, it's also all-frozen.

3a: "Other"
Either we couldn't mark the page all-visible after hinting everything
from step 0, or there's newly deleted tuples
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#1)
Re: Improving vacuum/VM/etc

On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Unfortunately, the states I came up with using existing semantics don't look
hugely useful[4], but if we take Robert's idea and make all-visible mean
all-frozen, we can do much better:

0: Newly inserted tuples
Tracking this state allows us to aggressively set hint bits.

Who is "us"? And what do you mean by "aggressively"? As things
stand, any process that has to touch a tuple always sets any
applicable hint bits.

1: Newly deleted
There are tuples that have been deleted but not pruned. There may also be
newly inserted tuples that need hinting (state 0).

Similar to state 0, we'd want to be fairly aggressive with these pages,
because as soon as the deleting XID is committed and older than all
snapshots we can prune. Because we can prune without hitting indexes, this
is still a fairly cheap operation, though not as cheap as 0.

What behavior difference would you foresee between state 0 and state 1?

2: Fully hinted, not frozen
This is the really painful state to clean up, because we have to deal with
indexes. We must enter this state after being in 1.

Neither the fact that a page is fully hinted nor the fact that it is
or is not frozen implies anything about dealing with indexes. We need
to deal with indexes because the page contains either dead tuples (as
a result of an aborted insert, a committed delete, or an aborted or
committed update) or dead line pointers (as a result of pruning dead
tuples).

3: All-visible-frozen
Every tuple on the page is visible and frozen. Pages in this state need no
maintenance at all. We might be able to enter this state directly from state
0.

BENEFITS
This tracking should help at least 3 problems: the need to set hint bits
after insert, SELECT queries doing pruning (Simon's recent complaint), and
needing to scan an entire table for freezing.

The improvement in hinting and pruning is based on the idea that normally
there would not be a lot of pages in state 0 or 1, and pages that were in
those states are very likely to still be in disk cache (if not shared
buffers). That means we can have a background process (or 2) that is very
aggressive at targeting pages in these states.

OK, I agree that a background process could be useful. Whenever it
sees a dirty page, it could attempt to aggressively set hint bits,
prune, mark all-visible, and freeze the page before that page gets
evicted. However, that doesn't require the sort of state map you're
proposing here.

I think your statement about "pages that were in those states are
still likely to be in the disk cache" is not really true. I mean, if
we're doing OLTP, yes. But not if we're bulk-loading.

Not needing to scan everything that's frozen is thanks to state 3. I think
it's OK (at least for now) if only vacuum puts pages into this state, which
means it can actually freeze the tuples when it does it (thanks to 37484ad
we won't lose forensic data doing this). That means there's no extra work
necessary by a foreground process that's dirtying a page.

Did you notice the discussion on the other thread about this
increasing WAL volume by a factor of 113?

Because of 37484ad, I think as part of this we should also deprecate
vacuum_freeze_min_age, or at least change it's behavior. AFAIK the only
objection to aggressive freezing was loss of forensic data, and that's gone
now. So vacuum (and presumably the bg process(es) than handle state 0 and 1)
should freeze tuples if it would allow the whole page to be frozen. Possibly
it should just do it any time it's dirtying the page. (We could actually do
this right now; it would let us eliminate the GUC, but I'm not sure there'd
be other benefit without the rest of this.)

Reducing vacuum_freeze_min_age certainly seems worth considering. I
don't know how to judge whether it's a good idea, though. You're
balancing less I/O later against a lot more WAL right now.

DOWNSIDES
This does mean doubling the size of the VM. It would still be 32,000 times
smaller than the heap with 8k pages (and 128,000 times smaller with the
common warehouse 32k page size), so I suspect this is a non-issue, but it's
worth mentioning. It might have some effect on a almost entirely read-only
system; but I suspect in most other cases the other benefits will outweigh
this.

I don't think that's a problem.

This approach still does nothing to help the index related activity in
vacuum. My gut says state 2 should be further split; but I'm not sure why.
Perhaps if we had another state we could do something more intelligent with
index cleanup...

I can't really follow why you've got these states to begin with. 0,
1, and 2 are all pretty much the same. The useful distinction AFAICS
is between not-all-visible, all-visible, and all-visible-plus-frozen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#2)
Re: Improving vacuum/VM/etc

On 4/24/15 6:50 AM, Robert Haas wrote:

Thanks for looking at this.

On Thu, Apr 23, 2015 at 3:09 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Unfortunately, the states I came up with using existing semantics don't look
hugely useful[4], but if we take Robert's idea and make all-visible mean
all-frozen, we can do much better:

0: Newly inserted tuples
Tracking this state allows us to aggressively set hint bits.

Who is "us"? And what do you mean by "aggressively"? As things
stand, any process that has to touch a tuple always sets any
applicable hint bits.

A background process that will proactively hint tuples before a
foreground process needs to. But see also below...

1: Newly deleted
There are tuples that have been deleted but not pruned. There may also be
newly inserted tuples that need hinting (state 0).

Similar to state 0, we'd want to be fairly aggressive with these pages,
because as soon as the deleting XID is committed and older than all
snapshots we can prune. Because we can prune without hitting indexes, this
is still a fairly cheap operation, though not as cheap as 0.

What behavior difference would you foresee between state 0 and state 1?

Below.

2: Fully hinted, not frozen
This is the really painful state to clean up, because we have to deal with
indexes. We must enter this state after being in 1.

Neither the fact that a page is fully hinted nor the fact that it is
or is not frozen implies anything about dealing with indexes. We need
to deal with indexes because the page contains either dead tuples (as
a result of an aborted insert, a committed delete, or an aborted or
committed update) or dead line pointers (as a result of pruning dead
tuples).

The idea I was shooting for is that the worst-case scenario in cleanup
is dealing with indexes, which we need to do any time a tuple becomes
dead. That's why I made 1 a separate state from 0, but it occurs to me
now that I wasn't very clear about this.

My goal here is that there are two separate "paths" for a page to be in:
either it needs index vacuuming at some point, or it doesn't. If a page
is in state 0, once we can make the page all-visible/frozen it can go
into state 3 and *we never have to clean it again*.

OTOH, if a tuple is marked dead (non-HOT), then we can be aggressive
about hinting (and pruning, if there were HOT updates as well), but no
matter what we must eventually include that page in index cleanup.

So once a page enters state 1 or 2, it may never move to state 0 or 3
without an index scan pass.

OK, I agree that a background process could be useful. Whenever it
sees a dirty page, it could attempt to aggressively set hint bits,
prune, mark all-visible, and freeze the page before that page gets
evicted. However, that doesn't require the sort of state map you're
proposing here.

I think your statement about "pages that were in those states are
still likely to be in the disk cache" is not really true. I mean, if
we're doing OLTP, yes. But not if we're bulk-loading.

Right, but at least we'd know we had a table with a load of unhinted or
newly dead tuples. That means there's cleanup work we can do without
needing an index pass.

Not needing to scan everything that's frozen is thanks to state 3. I think
it's OK (at least for now) if only vacuum puts pages into this state, which
means it can actually freeze the tuples when it does it (thanks to 37484ad
we won't lose forensic data doing this). That means there's no extra work
necessary by a foreground process that's dirtying a page.

Did you notice the discussion on the other thread about this
increasing WAL volume by a factor of 113?

Yeah, though I'd forgotten about it. :(

I wonder if there's some way we can reduce that. I just looked at what
we WAL log for a freeze and it appears to only be xl_heap_freeze_tuple,
which if my math is correct is 12 bytes (11 ignoring alignment). I don't
understand how that can be 113 times worse than a plain vacuum.

I can't really follow why you've got these states to begin with. 0,
1, and 2 are all pretty much the same. The useful distinction AFAICS
is between not-all-visible, all-visible, and all-visible-plus-frozen.

Index scanning is probably the most expensive part of cleanup, so it
seems like it would be useful to be able to track that as
visible/frozen. (What would probably be more useful is a way to directly
link a heap tuple to any index tuples pointing at it, but that would
certainly be a lot harder to do.)

There's also the idea of being proactive about hinting and pruning,
instead of foisting that onto later foreground processes or hoping that
vacuum comes along. Certainly the most obvious part is doing that before
buffers are evicted, but it's not uncommon for the OS cache to be 10x
larger (or more). Even if we can't hit these pages before they're all
the way on disk, if we at least know there's a pile of them we can do
something before a foreground process (or at least let the DBA know).
But if we are keeping the number of tuples in this state more limited
then perhaps a map isn't the right way to track it; I think that will
come down to a question of how that ties in with the long-lived states.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#3)
Re: Improving vacuum/VM/etc

On Fri, Apr 24, 2015 at 4:11 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

[ lots of discussion of various states ]

This is pretty confused, because you are conflating whether a page is
hinted and/or frozen with whether there are dead tuples on it. Those
two things are entirely unrelated.

Did you notice the discussion on the other thread about this
increasing WAL volume by a factor of 113?

Yeah, though I'd forgotten about it. :(

I wonder if there's some way we can reduce that. I just looked at what we
WAL log for a freeze and it appears to only be xl_heap_freeze_tuple, which
if my math is correct is 12 bytes (11 ignoring alignment). I don't
understand how that can be 113 times worse than a plain vacuum.

Freezing requires full page writes. Setting the all-visible bit doesn't.

I can't really follow why you've got these states to begin with. 0,
1, and 2 are all pretty much the same. The useful distinction AFAICS
is between not-all-visible, all-visible, and all-visible-plus-frozen.

Index scanning is probably the most expensive part of cleanup, so it seems
like it would be useful to be able to track that as visible/frozen.

The existing visibility map already does *exactly* that.

(What
would probably be more useful is a way to directly link a heap tuple to any
index tuples pointing at it, but that would certainly be a lot harder to
do.)

That's a complete non-starter. Even if you could make it work, it
would eat up huge quantities of disk space, generate massive
additional disk writes every time an index tuple was moved (which
happens constantly), and generally destroy performance completely.

There's also the idea of being proactive about hinting and pruning, instead
of foisting that onto later foreground processes or hoping that vacuum comes
along. Certainly the most obvious part is doing that before buffers are
evicted, but it's not uncommon for the OS cache to be 10x larger (or more).
Even if we can't hit these pages before they're all the way on disk, if we
at least know there's a pile of them we can do something before a foreground
process (or at least let the DBA know). But if we are keeping the number of
tuples in this state more limited then perhaps a map isn't the right way to
track it; I think that will come down to a question of how that ties in with
the long-lived states.

I think having a background process that prunes before eviction is an
idea worth exploring, but it's a totally separate question from what
states we need to track in a visibility map or similar.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers