reloption to prevent VACUUM from truncating empty pages at the end of relation
Hi,
I'd like to propose to add $SUBJECT for performance improvement.
When VACUUM tries to truncate the trailing empty pages, it scans shared_buffers
to invalidate the pages-to-truncate during holding an AccessExclusive lock on
the relation. So if shared_buffers is huge, other transactions need to wait for
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times on
the server with shared_buffers = 300GB while running the benchmark.
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such spikes
for that relation.
Also, first of all, if other transactions need to extend the relation
(i.e., need new pages) as soon as VACUUM truncates the empty pages at the end,
that truncation would not be so helpful for performance. In this case,
the truncation and extension of the relation are unnecessarily repeated,
which would decrease the performance. So, to alleviate this situation,
$SUBJECT is useful, I think.
Thought?
Regards,
--
Fujii Masao
Fujii Masao <masao.fujii@gmail.com> writes:
When VACUUM tries to truncate the trailing empty pages, it scans shared_buffers
to invalidate the pages-to-truncate during holding an AccessExclusive lock on
the relation. So if shared_buffers is huge, other transactions need to wait for
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times on
the server with shared_buffers = 300GB while running the benchmark.
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such spikes
for that relation.
I think that the real problem here is having to do a scan of all of shared
buffers. VACUUM's not the only thing that has to do that, there's also
e.g. DROP and TRUNCATE. So rather than a klugy solution that only fixes
VACUUM (and not very well, requiring user intervention and an unpleasant
tradeoff), we ought to look at ways to avoid needing a whole-pool scan to
find the pages belonging to one relation. In the past we've been able to
skate by without a decent solution for that because shared buffers were
customarily not all that big. But if we're going to start considering
huge buffer pools to be a case we want to have good performance for,
that's got to change.
regards, tom lane
On Tue, Apr 17, 2018 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So rather than a klugy solution that only fixes
VACUUM (and not very well, requiring user intervention and an unpleasant
tradeoff), we ought to look at ways to avoid needing a whole-pool scan to
find the pages belonging to one relation. In the past we've been able to
skate by without a decent solution for that because shared buffers were
customarily not all that big. But if we're going to start considering
huge buffer pools to be a case we want to have good performance for,
that's got to change.
Andres mentioned that he has prototyped an approach to buffer
management that uses a Radix tree, which is generally assumed to be
the right long-term fix.
--
Peter Geoghegan
Tom Lane wrote:
Fujii Masao <masao.fujii@gmail.com> writes:
When VACUUM tries to truncate the trailing empty pages, it scans shared_buffers
to invalidate the pages-to-truncate during holding an AccessExclusive lock on
the relation. So if shared_buffers is huge, other transactions need to wait for
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times on
the server with shared_buffers = 300GB while running the benchmark.
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such spikes
for that relation.I think that the real problem here is having to do a scan of all of shared
buffers. VACUUM's not the only thing that has to do that, there's also
e.g. DROP and TRUNCATE. So rather than a klugy solution that only fixes
VACUUM (and not very well, requiring user intervention and an unpleasant
tradeoff), we ought to look at ways to avoid needing a whole-pool scan to
find the pages belonging to one relation. In the past we've been able to
skate by without a decent solution for that because shared buffers were
customarily not all that big. But if we're going to start considering
huge buffer pools to be a case we want to have good performance for,
that's got to change.
Andres was working on a radix tree structure to fix this problem, but
that seems to be abandoned now, and it seems a major undertaking. While
I agree that the proposed solution is a wart, it seems much better than
no solution at all. Can we consider Fujii's proposal as a temporary
measure until we fix shared buffers? I'm +1 on it myself.
We've seen this problem also affecting a production workload pretty
severely, though shared_buffers is not as big.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Andres was working on a radix tree structure to fix this problem, but
that seems to be abandoned now, and it seems a major undertaking. While
I agree that the proposed solution is a wart, it seems much better than
no solution at all. Can we consider Fujii's proposal as a temporary
measure until we fix shared buffers? I'm +1 on it myself.
Once we've introduced a user-visible reloption it's going to be
practically impossible to get rid of it, so I'm -1. I'd much rather
see somebody put some effort into the radix-tree idea than introduce
a kluge that we'll be stuck with, and that doesn't even provide a
good user experience. Disabling vacuum truncation is *not* something
that I think we should recommend.
regards, tom lane
On 2018-04-17 15:09:18 -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Andres was working on a radix tree structure to fix this problem, but
that seems to be abandoned now, and it seems a major undertaking.
I hope to re-ignite work on that later in the v12 cycle. But
realistically that means it's not going to be mergable for v12.
While I agree that the proposed solution is a wart, it seems much
better than no solution at all. Can we consider Fujii's proposal as
a temporary measure until we fix shared buffers? I'm +1 on it
myself.Once we've introduced a user-visible reloption it's going to be
practically impossible to get rid of it, so I'm -1.
It's not much work to maintain though? And even the brief AEL lock can
cause troubles, leaving the scan aside. So I'm like +0.1 or such.
Greetings,
Andres Freund
On Tue, Apr 17, 2018 at 12:12:26PM -0700, Andres Freund wrote:
On 2018-04-17 15:09:18 -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Andres was working on a radix tree structure to fix this problem, but
that seems to be abandoned now, and it seems a major undertaking.I hope to re-ignite work on that later in the v12 cycle. But
realistically that means it's not going to be mergable for v12.
Need a push of man-hours for that?
While I agree that the proposed solution is a wart, it seems much
better than no solution at all. Can we consider Fujii's proposal as
a temporary measure until we fix shared buffers? I'm +1 on it
myself.Once we've introduced a user-visible reloption it's going to be
practically impossible to get rid of it, so I'm -1.It's not much work to maintain though? And even the brief AEL lock can
cause troubles, leaving the scan aside. So I'm like +0.1 or such.
I would say that if the radix tree patch can make it for the first
commit fest and has reviews, then there would be likely no need for this
reloption.
--
Michael
On April 17, 2018 6:00:59 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Apr 17, 2018 at 12:12:26PM -0700, Andres Freund wrote:
On 2018-04-17 15:09:18 -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Andres was working on a radix tree structure to fix this problem,
but
that seems to be abandoned now, and it seems a major undertaking.
I hope to re-ignite work on that later in the v12 cycle. But
realistically that means it's not going to be mergable for v12.Need a push of man-hours for that?
Not sure what you mean?
While I agree that the proposed solution is a wart, it seems much
better than no solution at all. Can we consider Fujii's proposalas
a temporary measure until we fix shared buffers? I'm +1 on it
myself.Once we've introduced a user-visible reloption it's going to be
practically impossible to get rid of it, so I'm -1.It's not much work to maintain though? And even the brief AEL lock
can
cause troubles, leaving the scan aside. So I'm like +0.1 or such.
I would say that if the radix tree patch can make it for the first
commit fest and has reviews, then there would be likely no need for
this
reloption.
There's no way it can.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Apr 17, 2018 at 06:13:31PM -0700, Andres Freund wrote:
Not sure what you mean?
Do you need help on it? I suggest that I could undertake the proposed
patch and submit it earlier in the development cycle of v12.
--
Michael
On 2018-04-18 10:46:51 +0900, Michael Paquier wrote:
On Tue, Apr 17, 2018 at 06:13:31PM -0700, Andres Freund wrote:
Not sure what you mean?
Do you need help on it? I suggest that I could undertake the proposed
patch and submit it earlier in the development cycle of v12.
I think it's at the very least two months of serious development work to
get it into a state ready for submission. And a good chunk of that not
even sketched out. Replacing the hashtable is the easy part, the memory
management (Complicated due to lock-freeness. I'm thinking of using a
variant of epoch based reclamation) isn't really there, the management
of shared "open relations" state are the hard parts...
So yes, I could use help on it, but it'll be a lot of actual design and
investigatory work.
Greetings,
Andres Freund
On Wed, Apr 18, 2018 at 7:46 AM, Andres Freund <andres@anarazel.de> wrote:
On 2018-04-18 10:46:51 +0900, Michael Paquier wrote:
On Tue, Apr 17, 2018 at 06:13:31PM -0700, Andres Freund wrote:
Not sure what you mean?
Do you need help on it? I suggest that I could undertake the proposed
patch and submit it earlier in the development cycle of v12.I think it's at the very least two months of serious development work to
get it into a state ready for submission. And a good chunk of that not
even sketched out. Replacing the hashtable is the easy part, the memory
management (Complicated due to lock-freeness. I'm thinking of using a
variant of epoch based reclamation) isn't really there, the management
of shared "open relations" state are the hard parts...So yes, I could use help on it, but it'll be a lot of actual design and
investigatory work.
I think it makes sense to pursue that approach, but it might be worth
considering some alternative till we have it. I remember last time
(in 2015) we have discussed some another solution [1]/messages/by-id/CAA4eK1JPLGjpMeJ5YLNE7bpNBhP2EQe_rDR+Aw3atNfj9WkAGg@mail.gmail.com to this problem
(or similar) and we have left it unattended in the hope that we will
get a better solution, but we are still in the same situation. I
think in general it is better to go with the approach which can fix
the root cause of the problem, but if that is going to take a long
time, it is not terrible to provide some workable solution which can
help users.
[1]: /messages/by-id/CAA4eK1JPLGjpMeJ5YLNE7bpNBhP2EQe_rDR+Aw3atNfj9WkAGg@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Hi,
I'd like to propose to add $SUBJECT for performance improvement.
When VACUUM tries to truncate the trailing empty pages, it scans
shared_buffers
to invalidate the pages-to-truncate during holding an AccessExclusive lock
on
the relation. So if shared_buffers is huge, other transactions need to
wait for
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times on
the server with shared_buffers = 300GB while running the benchmark.
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
spikes
for that relation.
Alvaro reminded me that we already have a mechanism in place which forces
VACUUM to give up the exclusive lock if another backend is waiting on the
lock for more than certain pre-defined duration. AFAICS we give up the
lock, but again retry truncation from the previously left off position.
What if we make that lock-wait duration configurable on a per-table basis?
And may be a special value to never truncate (though it seems quite
excessive to me and a possible footgun)
I was actually thinking in the other direction. So between the time VACUUM
figures out it can possibly truncate last K pages, some backend may insert
a tuple in some page and make the truncation impossible. What if we
truncate the FSM before starting the backward scan so that new inserts go
into the pages prior to the truncation point, if possible. That will
increase the chances of VACUUM being able to truncate all the empty pages.
Though I think in some cases it might lead to unnecessary further extension
of the relation. May be we use some heuristic based on available free space
in the table prior to the truncation point?
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 17 April 2018 at 20:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Andres was working on a radix tree structure to fix this problem, but
that seems to be abandoned now, and it seems a major undertaking. While
I agree that the proposed solution is a wart, it seems much better than
no solution at all. Can we consider Fujii's proposal as a temporary
measure until we fix shared buffers? I'm +1 on it myself.Once we've introduced a user-visible reloption it's going to be
practically impossible to get rid of it, so I'm -1. I'd much rather
see somebody put some effort into the radix-tree idea than introduce
a kluge that we'll be stuck with, and that doesn't even provide a
good user experience. Disabling vacuum truncation is *not* something
that I think we should recommend.
The truncation at the end of VACUUM takes an AccessExclusiveLock,
which is already user visible. Using a radix tree won't alter that.
ISTM the user might be interested in having the *lock* NOT happen, so
I am +1 for the suggestion regardless of whether radix tree ever
happens.
The lock itself can be cancelled, so the user would also be interested
in explicitly requesting a retry with a separate command/function.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 18, 2018 at 11:29 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Hi,
I'd like to propose to add $SUBJECT for performance improvement.
When VACUUM tries to truncate the trailing empty pages, it scans
shared_buffers
to invalidate the pages-to-truncate during holding an AccessExclusive lock
on
the relation. So if shared_buffers is huge, other transactions need to
wait for
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times on
the server with shared_buffers = 300GB while running the benchmark.
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
spikes
for that relation.Alvaro reminded me that we already have a mechanism in place which forces
VACUUM to give up the exclusive lock if another backend is waiting on the
lock for more than certain pre-defined duration. AFAICS we give up the lock,
but again retry truncation from the previously left off position. What if we
make that lock-wait duration configurable on a per-table basis? And may be a
special value to never truncate (though it seems quite excessive to me and a
possible footgun)
I'm not sure if it's safe to cancel forcibly VACUUM's truncation during
scaning shared_buffers. That scan happens after WAL-logging and before
the actual truncation.
I was actually thinking in the other direction. So between the time VACUUM
figures out it can possibly truncate last K pages, some backend may insert a
tuple in some page and make the truncation impossible. What if we truncate
the FSM before starting the backward scan so that new inserts go into the
pages prior to the truncation point, if possible. That will increase the
chances of VACUUM being able to truncate all the empty pages. Though I think
in some cases it might lead to unnecessary further extension of the
relation. May be we use some heuristic based on available free space in the
table prior to the truncation point?
Isn't this too complicated? I wonder what heuristic we can use here.
Regards,
--
Fujii Masao
On Wed, Apr 18, 2018 at 10:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
I'm not sure if it's safe to cancel forcibly VACUUM's truncation during
scaning shared_buffers. That scan happens after WAL-logging and before
the actual truncation.
Ah ok. I misread your proposal. This is about the shared_buffers scan
in DropRelFileNodeBuffers() and we can't cancel that operation.
What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again? Surely we revisit all to-be-truncated blocks before
actual truncation. So we already know which buffers to discard. And we're
holding exclusive lock at that point, so nothing can change underneath. Of
course, we can't really remember a large number of buffers, so we can do
this in small chunks. Scan last K blocks, remember those K buffers, discard
those K buffers, truncate the relation and then try for next K blocks. If
another backend requests lock on the table, we give up or retry after a
while.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again?
That's an idea.
Surely we revisit all to-be-truncated blocks before
actual truncation. So we already know which buffers to discard. And we're
holding exclusive lock at that point, so nothing can change underneath. Of
course, we can't really remember a large number of buffers, so we can do
this in small chunks.
Hm? We're deleting the last N consecutive blocks, so it seems like we
just need to think in terms of clearing that range. I think this can
just be a local logic change inside DropRelFileNodeBuffers().
You could optimize it fairly easily with some heuristic that compares
N to sizeof shared buffers; if it's too large a fraction, the existing
implementation will be cheaper than a bunch of hashtable probes.
regards, tom lane
I wrote:
Pavan Deolasee <pavan.deolasee@gmail.com> writes:
What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again?
That's an idea.
BTW, before pushing too hard on any of this, we need to think about the
data-corruption hazard that MauMau just reminded us about. I'm afraid
what we're likely to end up with after the dust settles is worse
performance than today, not better :-(.
/messages/by-id/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC
regards, tom lane
On Wed, Apr 18, 2018 at 07:41:44PM +0530, Amit Kapila wrote:
I think it makes sense to pursue that approach, but it might be worth
considering some alternative till we have it. I remember last time
(in 2015) we have discussed some another solution [1] to this problem
(or similar) and we have left it unattended in the hope that we will
get a better solution, but we are still in the same situation. I
think in general it is better to go with the approach which can fix
the root cause of the problem, but if that is going to take a long
time, it is not terrible to provide some workable solution which can
help users.
Yeah, I can understand that feeling. When we talked about the
compression of FPWs back in 9.5, we discussed that if we had
double-writes then this would not be necessary, and we are still with
wal_compression but without double writes (actually, it happens that
compression of pages can also be used with double writes, but that's
enough highjacking for this thread..).
Then, let's consider the beginning of the first commit fest of v12 as
judgement. Implementing radix tree for shared buffers is a long-term
project, which has no guarantee to get merged, while a visibly-simple
reloptions which helps in some cases...
--
Michael
From: Fujii Masao [mailto:masao.fujii@gmail.com]
a very long time before accessing to the relation. Which would cause the
response-time spikes, for example, I observed such spikes several times
on
the server with shared_buffers = 300GB while running the benchmark.
FYI, a long transaction took about 900 ms, while the average transaction response time was 150 ms or so. (I'm working with Fujii-san in this performance benchmark.)
Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
spikes
for that relation.
How about an integer variable to replace the following?
#define REL_TRUNCATE_FRACTION 16
Also, first of all, if other transactions need to extend the relation
(i.e., need new pages) as soon as VACUUM truncates the empty pages at the
end,
that truncation would not be so helpful for performance. In this case,
the truncation and extension of the relation are unnecessarily repeated,
which would decrease the performance. So, to alleviate this situation,
$SUBJECT is useful, I think.
I wonder if fillfactor=50 would alleviate this situation.
Regards
Takayuki Tsunakawa
Michael Paquier wrote:
Then, let's consider the beginning of the first commit fest of v12 as
judgement. Implementing radix tree for shared buffers is a long-term
project, which has no guarantee to get merged, while a visibly-simple
reloptions which helps in some cases...
In the scenario we studied, the truncations were causing periodic
hiccups which were quite severe. The truncations were completely
useless anyway because the table grew back to the original size daily (a
few dozen GBs I think). That was a lot of unnecessary work, and under
exclusive lock no less.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services