cost based vacuum (parallel)

Started by Amit Kapilaabout 6 years ago44 messages
#1Amit Kapila
amit.kapila16@gmail.com

For parallel vacuum [1]https://commitfest.postgresql.org/25/1774/, we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

The initial version of the patch has a very rudimentary way of doing
it which means each parallel vacuum worker operates independently
w.r.t vacuum delay and cost. This will lead to more I/O in the system
than the user has intended to do. Assume that the overall I/O allowed
for vacuum operation is X after which it will sleep for some time,
reset the balance and continue. In the patch, each worker will be
allowed to perform X before which it can sleep and also there is no
coordination for the same with master backend which would have done
some I/O for the heap. So, in the worst-case scenario, there can be n
times more I/O where n is the number of workers doing the parallel
operation. This is somewhat similar to a memory usage problem with a
parallel query where each worker is allowed to use up to work_mem of
memory. We can say that the users using parallel operation can expect
more system resources to be used as they want to get the operation
done faster, so we are fine with this. However, I am not sure if that
is the right thing, so we should try to come up with some solution for
it and if the solution is too complex, then probably we can think of
documenting such behavior.

The two approaches to solve this problem being discussed in that
thread [1]https://commitfest.postgresql.org/25/1774/ are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]/messages/by-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD+XOh_Q@mail.gmail.com. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

(b) The other idea could be that we split the I/O among workers
something similar to what we do for auto vacuum workers (see
autovac_balance_cost). The basic idea would be that before launching
workers, we need to compute the remaining I/O (heap operation would
have used something) after which we need to sleep and split it equally
across workers. Here, we are primarily thinking of dividing
VacuumCostBalance and VacuumCostLimit parameters. Once the workers
are finished, they need to let master backend know how much I/O they
have consumed and then master backend can add it to it's current I/O
consumed. I think we also need to rebalance the cost of remaining
workers once some of the worker's exit. Dilip has prepared a POC
patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3]/messages/by-id/CAFiTN-thU-z8f04jO7xGMu5yUUpTpsBTvBrFW6EhRf-jGvEz=g@mail.gmail.com.

I think approach-2 is better in throttling the system as it doesn't
have the drawback of the first approach, but it might be a bit tricky
to implement.

As of now, the POC for both the approaches has been developed and we
see similar results for both approaches, but we have tested simpler
cases where each worker has similar amount of I/O to perform.

Thoughts?

[1]: https://commitfest.postgresql.org/25/1774/
[2]: /messages/by-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD+XOh_Q@mail.gmail.com
[3]: /messages/by-id/CAFiTN-thU-z8f04jO7xGMu5yUUpTpsBTvBrFW6EhRf-jGvEz=g@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In reply to: Amit Kapila (#1)
Re: cost based vacuum (parallel)

This is somewhat similar to a memory usage problem with a
parallel query where each worker is allowed to use up to work_mem of
memory. We can say that the users using parallel operation can expect
more system resources to be used as they want to get the operation
done faster, so we are fine with this. However, I am not sure if that
is the right thing, so we should try to come up with some solution for
it and if the solution is too complex, then probably we can think of
documenting such behavior.

In cloud environments (Amazon + gp2) there's a budget on input/output
operations. If you cross it for long time, everything starts looking like
you work with a floppy disk.

For the ease of configuration, I would need a "max_vacuum_disk_iops" that
would limit number of input-output operations by all of the vacuums in the
system. If I set it to less than value of budget refill, I can be sure than
that no vacuum runs too fast to impact any sibling query.

There's also value in non-throttled VACUUM for smaller tables. On gp2 such
things will be consumed out of surge budget, and its size is known to
sysadmin. Let's call it "max_vacuum_disk_surge_iops" - if a relation has
less blocks than this value and it's a blocking in any way situation
(antiwraparound, interactive console, ...) - go on and run without
throttling.

For how to balance the cost: if we know a number of vacuum processes that
were running in the previous second, we can just divide a slot for this
iteration by that previous number.

To correct for overshots, we can subtract the previous second's overshot
from next one's. That would also allow to account for surge budget usage
and let it refill, pausing all autovacuum after a manual one for some time.

Precision of accounting limiting count of operations more than once a
second isn't beneficial for this use case.

Please don't forget that processing one page can become several iops (read,
write, wal).

Does this make sense? :)

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#1)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think approach-2 is better in throttling the system as it doesn't
have the drawback of the first approach, but it might be a bit tricky
to implement.

I might be missing something but I think that there could be the
drawback of the approach-1 even on approach-2 depending on index pages
loaded on the shared buffer and the vacuum delay setting. Is it right?

Regards,

---
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#3)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think approach-2 is better in throttling the system as it doesn't
have the drawback of the first approach, but it might be a bit tricky
to implement.

I might be missing something but I think that there could be the
drawback of the approach-1 even on approach-2 depending on index pages
loaded on the shared buffer and the vacuum delay setting.

Can you be a bit more specific about this?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Darafei "Komяpa" Praliaskouski (#2)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 1:03 PM Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:

This is somewhat similar to a memory usage problem with a
parallel query where each worker is allowed to use up to work_mem of
memory. We can say that the users using parallel operation can expect
more system resources to be used as they want to get the operation
done faster, so we are fine with this. However, I am not sure if that
is the right thing, so we should try to come up with some solution for
it and if the solution is too complex, then probably we can think of
documenting such behavior.

In cloud environments (Amazon + gp2) there's a budget on input/output operations. If you cross it for long time, everything starts looking like you work with a floppy disk.

For the ease of configuration, I would need a "max_vacuum_disk_iops" that would limit number of input-output operations by all of the vacuums in the system. If I set it to less than value of budget refill, I can be sure than that no vacuum runs too fast to impact any sibling query.

There's also value in non-throttled VACUUM for smaller tables. On gp2 such things will be consumed out of surge budget, and its size is known to sysadmin. Let's call it "max_vacuum_disk_surge_iops" - if a relation has less blocks than this value and it's a blocking in any way situation (antiwraparound, interactive console, ...) - go on and run without throttling.

I think the need for these things can be addressed by current
cost-based-vacuum parameters. See docs [1]https://www.postgresql.org/docs/devel/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-VACUUM-COST. For example, if you set
vacuum_cost_delay as zero, it will allow the operation to be performed
without throttling.

For how to balance the cost: if we know a number of vacuum processes that were running in the previous second, we can just divide a slot for this iteration by that previous number.

To correct for overshots, we can subtract the previous second's overshot from next one's. That would also allow to account for surge budget usage and let it refill, pausing all autovacuum after a manual one for some time.

Precision of accounting limiting count of operations more than once a second isn't beneficial for this use case.

I think it is better if we find a way to rebalance the cost on some
worker exit rather than every second as anyway it won't change unless
any worker exits.

[1]: https://www.postgresql.org/docs/devel/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-VACUUM-COST

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#6Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#4)
Re: cost based vacuum (parallel)

On Mon, 4 Nov 2019 at 19:26, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think approach-2 is better in throttling the system as it doesn't
have the drawback of the first approach, but it might be a bit tricky
to implement.

I might be missing something but I think that there could be the
drawback of the approach-1 even on approach-2 depending on index pages
loaded on the shared buffer and the vacuum delay setting.

Can you be a bit more specific about this?

Suppose there are two indexes: one index is loaded at all while
another index isn't. One vacuum worker who processes the former index
hits all pages on the shared buffer but another worker who processes
the latter index read all pages from either OS page cache or disk.
Even if both the cost limit and the cost balance are split evenly
among workers because the cost of page hits and page misses are
different it's possible that one vacuum worker sleeps while other
workers doing I/O.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#1)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

Cheers,

Jeff

#8Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#1)
Re: cost based vacuum (parallel)

Hi,

On 2019-11-04 12:24:35 +0530, Amit Kapila wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

The initial version of the patch has a very rudimentary way of doing
it which means each parallel vacuum worker operates independently
w.r.t vacuum delay and cost.

Yea, that seems not ok for cases where vacuum delay is active.

There's also the question of when/why it is beneficial to use
parallelism when you're going to encounter IO limits in all likelihood.

This will lead to more I/O in the system
than the user has intended to do. Assume that the overall I/O allowed
for vacuum operation is X after which it will sleep for some time,
reset the balance and continue. In the patch, each worker will be
allowed to perform X before which it can sleep and also there is no
coordination for the same with master backend which would have done
some I/O for the heap. So, in the worst-case scenario, there can be n
times more I/O where n is the number of workers doing the parallel
operation. This is somewhat similar to a memory usage problem with a
parallel query where each worker is allowed to use up to work_mem of
memory. We can say that the users using parallel operation can expect
more system resources to be used as they want to get the operation
done faster, so we are fine with this. However, I am not sure if that
is the right thing, so we should try to come up with some solution for
it and if the solution is too complex, then probably we can think of
documenting such behavior.

I mean for parallel query the problem wasn't really introduced in
parallel query, it existed before - and does still - for non-parallel
queries. And there's a complex underlying planning issue. I don't think
this is a good excuse for VACUUM, where none of the complex "number of
paths considered" issues etc apply.

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

(b) The other idea could be that we split the I/O among workers
something similar to what we do for auto vacuum workers (see
autovac_balance_cost). The basic idea would be that before launching
workers, we need to compute the remaining I/O (heap operation would
have used something) after which we need to sleep and split it equally
across workers. Here, we are primarily thinking of dividing
VacuumCostBalance and VacuumCostLimit parameters. Once the workers
are finished, they need to let master backend know how much I/O they
have consumed and then master backend can add it to it's current I/O
consumed. I think we also need to rebalance the cost of remaining
workers once some of the worker's exit. Dilip has prepared a POC
patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3].

(b) doesn't strike me as advantageous. It seems quite possible that you
end up with one worker that has a lot more IO than others, leading to
unnecessary sleeps, even though the actually available IO budget has not
been used up. Quite easy to see how that'd lead to parallel VACUUM
having a lower throughput than a single threaded one.

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#7)
Re: cost based vacuum (parallel)

Hi,

On 2019-11-04 12:59:02 -0500, Jeff Janes wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

I am wondering the same - but to be fair, it's pretty easy to run into
cases where VACUUM is CPU bound. E.g. because most pages are in
shared_buffers, and compared to the size of the indexes number of tids
that need to be pruned is fairly small (also [1]I don't think the patch addresses this, IIUC it's only running index vacuums in parallel, but it's very easy to run into being CPU bottlenecked when vacuuming a busily updated table. heap_hot_prune can be really expensive, especially with longer update chains (I think it may have an O(n^2) worst case even).). That means a lot of
pages need to be scanned, without a whole lot of IO going on. The
problem with that is just that the defaults for vacuum throttling will
also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0,
vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest
cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty
for unlogged tables.

So while it doesn't seem unreasonable to want to use cost limiting to
protect against vacuum unexpectedly causing too much, especially read,
IO, I'm doubtful it has current practical relevance.

I'm wondering how much of the benefit of parallel vacuum really is just
to work around vacuum ringbuffers often massively hurting performance
(see e.g. [2]/messages/by-id/20160406105716.fhk2eparljthpzp6@alap3.anarazel.de). Surely not all, but I'd be very unsurprised if it were a
large fraction.

Greetings,

Andres Freund

[1]: I don't think the patch addresses this, IIUC it's only running index vacuums in parallel, but it's very easy to run into being CPU bottlenecked when vacuuming a busily updated table. heap_hot_prune can be really expensive, especially with longer update chains (I think it may have an O(n^2) worst case even).
vacuums in parallel, but it's very easy to run into being CPU
bottlenecked when vacuuming a busily updated table. heap_hot_prune
can be really expensive, especially with longer update chains (I
think it may have an O(n^2) worst case even).
[2]: /messages/by-id/20160406105716.fhk2eparljthpzp6@alap3.anarazel.de

#10Stephen Frost
sfrost@snowman.net
In reply to: Jeff Janes (#7)
Re: cost based vacuum (parallel)

Greetings,

* Jeff Janes (jeff.janes@gmail.com) wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

With parallelization across indexes, you could have a situation where
the individual indexes are on different tablespaces with independent
i/o, therefore the parallelization ends up giving you an increase in i/o
throughput, not just additional CPU time.

Thanks,

Stephen

#11Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#10)
Re: cost based vacuum (parallel)

Hi,

On 2019-11-04 14:06:19 -0500, Stephen Frost wrote:

* Jeff Janes (jeff.janes@gmail.com) wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

With parallelization across indexes, you could have a situation where
the individual indexes are on different tablespaces with independent
i/o, therefore the parallelization ends up giving you an increase in i/o
throughput, not just additional CPU time.

How's that related to IO throttling being active or not?

Greetings,

Andres Freund

#12Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#11)
Re: cost based vacuum (parallel)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2019-11-04 14:06:19 -0500, Stephen Frost wrote:

* Jeff Janes (jeff.janes@gmail.com) wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

With parallelization across indexes, you could have a situation where
the individual indexes are on different tablespaces with independent
i/o, therefore the parallelization ends up giving you an increase in i/o
throughput, not just additional CPU time.

How's that related to IO throttling being active or not?

You might find that you have to throttle the IO down when operating
exclusively against one IO channel, but if you have multiple IO channels
then the acceptable IO utilization could be higher as it would be
spread across the different IO channels.

In other words, the overall i/o allowance for a given operation might be
able to be higher if it's spread across multiple i/o channels, as it
wouldn't completely consume the i/o resources of any of them, whereas
with a higher allowance and a single i/o channel, there would likely be
an impact to other operations.

As for if this is really relevant only when it comes to parallel
operations is a bit of an interesting question- these considerations
might not require actual parallel operations as a single process might
be able to go through multiple indexes concurrently and still hit the
i/o limit that was set for it overall across the tablespaces. I don't
know that it would actually be interesting or useful to spend the effort
to make that work though, so, from a practical perspective, it's
probably only interesting to think about this when talking about
parallel vacuum.

I've been wondering if the accounting system should consider the cost
per tablespace when there's multiple tablespaces involved, instead of
throttling the overall process without consideration for the
per-tablespace utilization.

Thanks,

Stephen

#13Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#12)
Re: cost based vacuum (parallel)

Hi,

On 2019-11-04 14:33:41 -0500, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

On 2019-11-04 14:06:19 -0500, Stephen Frost wrote:

With parallelization across indexes, you could have a situation where
the individual indexes are on different tablespaces with independent
i/o, therefore the parallelization ends up giving you an increase in i/o
throughput, not just additional CPU time.

How's that related to IO throttling being active or not?

You might find that you have to throttle the IO down when operating
exclusively against one IO channel, but if you have multiple IO channels
then the acceptable IO utilization could be higher as it would be
spread across the different IO channels.

In other words, the overall i/o allowance for a given operation might be
able to be higher if it's spread across multiple i/o channels, as it
wouldn't completely consume the i/o resources of any of them, whereas
with a higher allowance and a single i/o channel, there would likely be
an impact to other operations.

As for if this is really relevant only when it comes to parallel
operations is a bit of an interesting question- these considerations
might not require actual parallel operations as a single process might
be able to go through multiple indexes concurrently and still hit the
i/o limit that was set for it overall across the tablespaces. I don't
know that it would actually be interesting or useful to spend the effort
to make that work though, so, from a practical perspective, it's
probably only interesting to think about this when talking about
parallel vacuum.

But you could just apply different budgets for different tablespaces?
That's quite doable independent of parallelism, as we don't have tables
or indexes spanning more than one tablespace. True, you could then make
the processing of an individual vacuum faster by allowing to utilize
multiple tablespace budgets at the same time.

I've been wondering if the accounting system should consider the cost
per tablespace when there's multiple tablespaces involved, instead of
throttling the overall process without consideration for the
per-tablespace utilization.

This all seems like a feature proposal, or two, independent of the
patch/question at hand. I think there's a good argument to be had that
we should severely overhaul the current vacuum cost limiting - it's way
way too hard to understand the bandwidth that it's allowed to
consume. But unless one of the proposals makes that measurably harder or
easier, I think we don't gain anything by entangling an already complex
patchset with something new.

Greetings,

Andres Freund

#14Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#13)
Re: cost based vacuum (parallel)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2019-11-04 14:33:41 -0500, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

On 2019-11-04 14:06:19 -0500, Stephen Frost wrote:

With parallelization across indexes, you could have a situation where
the individual indexes are on different tablespaces with independent
i/o, therefore the parallelization ends up giving you an increase in i/o
throughput, not just additional CPU time.

How's that related to IO throttling being active or not?

You might find that you have to throttle the IO down when operating
exclusively against one IO channel, but if you have multiple IO channels
then the acceptable IO utilization could be higher as it would be
spread across the different IO channels.

In other words, the overall i/o allowance for a given operation might be
able to be higher if it's spread across multiple i/o channels, as it
wouldn't completely consume the i/o resources of any of them, whereas
with a higher allowance and a single i/o channel, there would likely be
an impact to other operations.

As for if this is really relevant only when it comes to parallel
operations is a bit of an interesting question- these considerations
might not require actual parallel operations as a single process might
be able to go through multiple indexes concurrently and still hit the
i/o limit that was set for it overall across the tablespaces. I don't
know that it would actually be interesting or useful to spend the effort
to make that work though, so, from a practical perspective, it's
probably only interesting to think about this when talking about
parallel vacuum.

But you could just apply different budgets for different tablespaces?

Yes, that would be one approach to addressing this, though it would
change the existing meaning of those cost parameters. I'm not sure if
we think that's an issue or not- if we only have this in the case of a
parallel vacuum then it's probably fine, I'm less sure if it'd be
alright to change that on an upgrade.

That's quite doable independent of parallelism, as we don't have tables
or indexes spanning more than one tablespace. True, you could then make
the processing of an individual vacuum faster by allowing to utilize
multiple tablespace budgets at the same time.

Yes, it's possible to do independent of parallelism, but what I was
trying to get at above is that it might not be worth the effort. When
it comes to parallel vacuum though, I'm not sure that you can just punt
on this question since you'll naturally end up spanning multiple
tablespaces concurrently, at least if the heap+indexes are spread across
multiple tablespaces and you're operating against more than one of those
relations at a time (which, I admit, I'm not 100% sure is actually
happening with this proposed patch set- if it isn't, then this isn't
really an issue, though that would be pretty unfortunate as then you
can't leverage multiple i/o channels concurrently and therefore Jeff's
question about why you'd be doing parallel vacuum with IO throttling is
a pretty good one).

Thanks,

Stephen

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#8)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

I think the problem could be that the system is not properly throttled
when it is supposed to be. Let me try by a simple example, say we
have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
w-1 is doing very less I/O but unfortunately whenever w-1 checks it
finds that cost_limit has exceeded and it goes for sleep, but w-1
still continues. Now in such a situation even though we have made one
of the workers slept for a required time but ideally the worker which
was doing I/O should have slept. The aim is to make the system stop
doing I/O whenever the limit has exceeded, so that might not work in
the above situation.

(b) The other idea could be that we split the I/O among workers
something similar to what we do for auto vacuum workers (see
autovac_balance_cost). The basic idea would be that before launching
workers, we need to compute the remaining I/O (heap operation would
have used something) after which we need to sleep and split it equally
across workers. Here, we are primarily thinking of dividing
VacuumCostBalance and VacuumCostLimit parameters. Once the workers
are finished, they need to let master backend know how much I/O they
have consumed and then master backend can add it to it's current I/O
consumed. I think we also need to rebalance the cost of remaining
workers once some of the worker's exit. Dilip has prepared a POC
patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3].

(b) doesn't strike me as advantageous. It seems quite possible that you
end up with one worker that has a lot more IO than others, leading to
unnecessary sleeps, even though the actually available IO budget has not
been used up.

Yeah, this is possible, but to an extent, this is possible in the
current design as well where we balance the cost among autovacuum
workers. Now, it is quite possible that the current design itself is
not good and we don't want to do the same thing at another place, but
at least we will be consistent and can explain the overall behavior.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#9)
Re: cost based vacuum (parallel)

On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-04 12:59:02 -0500, Jeff Janes wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

I am wondering the same - but to be fair, it's pretty easy to run into
cases where VACUUM is CPU bound. E.g. because most pages are in
shared_buffers, and compared to the size of the indexes number of tids
that need to be pruned is fairly small (also [1]). That means a lot of
pages need to be scanned, without a whole lot of IO going on. The
problem with that is just that the defaults for vacuum throttling will
also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0,
vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest
cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty
for unlogged tables.

So while it doesn't seem unreasonable to want to use cost limiting to
protect against vacuum unexpectedly causing too much, especially read,
IO, I'm doubtful it has current practical relevance.

IIUC, you mean to say that it is of not much practical use to do
parallel vacuum if I/O throttling is enabled for an operation, is that
right?

I'm wondering how much of the benefit of parallel vacuum really is just
to work around vacuum ringbuffers often massively hurting performance
(see e.g. [2]).

Yeah, it is a good thing to check, but if anything, I think a parallel
vacuum will further improve the performance with larger ring buffers
as it will make it more CPU bound.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#13)
Re: cost based vacuum (parallel)

On Tue, Nov 5, 2019 at 1:12 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-11-04 14:33:41 -0500, Stephen Frost wrote:

I've been wondering if the accounting system should consider the cost
per tablespace when there's multiple tablespaces involved, instead of
throttling the overall process without consideration for the
per-tablespace utilization.

This all seems like a feature proposal, or two, independent of the
patch/question at hand. I think there's a good argument to be had that
we should severely overhaul the current vacuum cost limiting - it's way
way too hard to understand the bandwidth that it's allowed to
consume. But unless one of the proposals makes that measurably harder or
easier, I think we don't gain anything by entangling an already complex
patchset with something new.

+1. I think even if we want something related to per-tablespace
costing for vacuum (parallel), it should be done as a separate patch.
It is a whole new area where we need to define what is the appropriate
way to achieve. It is going to change the current vacuum costing
system in a big way which I don't think is reasonable to do as part of
a parallel vacuum patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#16)
Re: cost based vacuum (parallel)

On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-04 12:59:02 -0500, Jeff Janes wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

I am wondering the same - but to be fair, it's pretty easy to run into
cases where VACUUM is CPU bound. E.g. because most pages are in
shared_buffers, and compared to the size of the indexes number of tids
that need to be pruned is fairly small (also [1]). That means a lot of
pages need to be scanned, without a whole lot of IO going on. The
problem with that is just that the defaults for vacuum throttling will
also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0,
vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest
cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty
for unlogged tables.

So while it doesn't seem unreasonable to want to use cost limiting to
protect against vacuum unexpectedly causing too much, especially read,
IO, I'm doubtful it has current practical relevance.

IIUC, you mean to say that it is of not much practical use to do
parallel vacuum if I/O throttling is enabled for an operation, is that
right?

I'm wondering how much of the benefit of parallel vacuum really is just
to work around vacuum ringbuffers often massively hurting performance
(see e.g. [2]).

Yeah, it is a good thing to check, but if anything, I think a parallel
vacuum will further improve the performance with larger ring buffers
as it will make it more CPU bound.

I have tested the same and the results prove that increasing the ring
buffer size we can see the performance gain. And, the gain is much
more with the parallel vacuum.

Test case:
create table test(a int, b int, c int, d int, e int, f int, g int, h int);
create index idx1 on test(a);
create index idx2 on test(b);
create index idx3 on test(c);
create index idx4 on test(d);
create index idx5 on test(e);
create index idx6 on test(f);
create index idx7 on test(g);
create index idx8 on test(h);
insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000) as i;
delete from test where a < 300000;

( I have tested the parallel vacuum and non-parallel vacuum with
different ring buffer size)

8 index
ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2
worker): 3.9 seconds
ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2
worker): 3.2 seconds

4 index
ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2
worker): 3.2 seconds
ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2
worker): 2.6 seconds

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#19Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#18)
Re: cost based vacuum (parallel)

Hi,

On November 5, 2019 7:16:41 AM PST, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de>

wrote:

Hi,

On 2019-11-04 12:59:02 -0500, Jeff Janes wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila

<amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best

way to

divide the cost among parallel workers but we didn't get many

inputs

apart from people who are very actively involved in patch

development.

I feel that we need some more inputs before we finalize

anything, so

starting a new thread.

Maybe a I just don't have experience in the type of system that

parallel

vacuum is needed for, but if there is any meaningful IO

throttling which is

active, then what is the point of doing the vacuum in parallel in

the first

place?

I am wondering the same - but to be fair, it's pretty easy to run

into

cases where VACUUM is CPU bound. E.g. because most pages are in
shared_buffers, and compared to the size of the indexes number of

tids

that need to be pruned is fairly small (also [1]). That means a lot

of

pages need to be scanned, without a whole lot of IO going on. The
problem with that is just that the defaults for vacuum throttling

will

also apply here, I've never seen anybody tune vacuum_cost_page_hit

= 0,

vacuum_cost_page_dirty=0 or such (in contrast, the latter is the

highest

cost currently). Nor do we reduce the cost of

vacuum_cost_page_dirty

for unlogged tables.

So while it doesn't seem unreasonable to want to use cost limiting

to

protect against vacuum unexpectedly causing too much, especially

read,

IO, I'm doubtful it has current practical relevance.

IIUC, you mean to say that it is of not much practical use to do
parallel vacuum if I/O throttling is enabled for an operation, is

that

right?

I'm wondering how much of the benefit of parallel vacuum really is

just

to work around vacuum ringbuffers often massively hurting

performance

(see e.g. [2]).

Yeah, it is a good thing to check, but if anything, I think a

parallel

vacuum will further improve the performance with larger ring buffers
as it will make it more CPU bound.

I have tested the same and the results prove that increasing the ring
buffer size we can see the performance gain. And, the gain is much
more with the parallel vacuum.

Test case:
create table test(a int, b int, c int, d int, e int, f int, g int, h
int);
create index idx1 on test(a);
create index idx2 on test(b);
create index idx3 on test(c);
create index idx4 on test(d);
create index idx5 on test(e);
create index idx6 on test(f);
create index idx7 on test(g);
create index idx8 on test(h);
insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000)
as i;
delete from test where a < 300000;

( I have tested the parallel vacuum and non-parallel vacuum with
different ring buffer size)

Thanks!

8 index
ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2
worker): 3.9 seconds
ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2
worker): 3.2 seconds

4 index
ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2
worker): 3.2 seconds
ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2
worker): 2.6 seconds

What about the case of just disabling the ring buffer logic?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#19)
Re: cost based vacuum (parallel)

On Tue, Nov 5, 2019 at 8:49 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On November 5, 2019 7:16:41 AM PST, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de>

wrote:

Hi,

On 2019-11-04 12:59:02 -0500, Jeff Janes wrote:

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila

<amit.kapila16@gmail.com> wrote:

For parallel vacuum [1], we were discussing what is the best

way to

divide the cost among parallel workers but we didn't get many

inputs

apart from people who are very actively involved in patch

development.

I feel that we need some more inputs before we finalize

anything, so

starting a new thread.

Maybe a I just don't have experience in the type of system that

parallel

vacuum is needed for, but if there is any meaningful IO

throttling which is

active, then what is the point of doing the vacuum in parallel in

the first

place?

I am wondering the same - but to be fair, it's pretty easy to run

into

cases where VACUUM is CPU bound. E.g. because most pages are in
shared_buffers, and compared to the size of the indexes number of

tids

that need to be pruned is fairly small (also [1]). That means a lot

of

pages need to be scanned, without a whole lot of IO going on. The
problem with that is just that the defaults for vacuum throttling

will

also apply here, I've never seen anybody tune vacuum_cost_page_hit

= 0,

vacuum_cost_page_dirty=0 or such (in contrast, the latter is the

highest

cost currently). Nor do we reduce the cost of

vacuum_cost_page_dirty

for unlogged tables.

So while it doesn't seem unreasonable to want to use cost limiting

to

protect against vacuum unexpectedly causing too much, especially

read,

IO, I'm doubtful it has current practical relevance.

IIUC, you mean to say that it is of not much practical use to do
parallel vacuum if I/O throttling is enabled for an operation, is

that

right?

I'm wondering how much of the benefit of parallel vacuum really is

just

to work around vacuum ringbuffers often massively hurting

performance

(see e.g. [2]).

Yeah, it is a good thing to check, but if anything, I think a

parallel

vacuum will further improve the performance with larger ring buffers
as it will make it more CPU bound.

I have tested the same and the results prove that increasing the ring
buffer size we can see the performance gain. And, the gain is much
more with the parallel vacuum.

Test case:
create table test(a int, b int, c int, d int, e int, f int, g int, h
int);
create index idx1 on test(a);
create index idx2 on test(b);
create index idx3 on test(c);
create index idx4 on test(d);
create index idx5 on test(e);
create index idx6 on test(f);
create index idx7 on test(g);
create index idx8 on test(h);
insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000)
as i;
delete from test where a < 300000;

( I have tested the parallel vacuum and non-parallel vacuum with
different ring buffer size)

Thanks!

8 index
ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2
worker): 3.9 seconds
ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2
worker): 3.2 seconds

4 index
ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2
worker): 3.2 seconds
ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2
worker): 2.6 seconds

What about the case of just disabling the ring buffer logic?

Repeated the same test by disabling the ring buffer logic. Results
are almost same as increasing the ring buffer size.

Tested with 4GB shared buffers:

8 index
use shared buffers -> non-parallel: 6.2seconds parallel (2 worker): 3.3seconds

4 index
use shared buffer -> non-parallel: 3.8seconds parallel (2 worker): 2.7seconds

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#14)
Re: cost based vacuum (parallel)

On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

That's quite doable independent of parallelism, as we don't have tables
or indexes spanning more than one tablespace. True, you could then make
the processing of an individual vacuum faster by allowing to utilize
multiple tablespace budgets at the same time.

Yes, it's possible to do independent of parallelism, but what I was
trying to get at above is that it might not be worth the effort. When
it comes to parallel vacuum though, I'm not sure that you can just punt
on this question since you'll naturally end up spanning multiple
tablespaces concurrently, at least if the heap+indexes are spread across
multiple tablespaces and you're operating against more than one of those
relations at a time

Each parallel worker operates on a separate index. It might be worth
exploring per-tablespace vacuum throttling, but that should not be a
requirement for the currently proposed patch.

As per feedback in this thread, it seems that for now, it is better,
if we can allow a parallel vacuum only when I/O throttling is not
enabled. We can later extend it based on feedback from the field once
the feature starts getting used.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#22Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#21)
Re: cost based vacuum (parallel)

Hi,

On 2019-11-06 07:53:09 +0530, Amit Kapila wrote:

As per feedback in this thread, it seems that for now, it is better,
if we can allow a parallel vacuum only when I/O throttling is not
enabled. We can later extend it based on feedback from the field once
the feature starts getting used.

That's not my read on this thread. I don't think we should introduce
this feature without a solution for the throttling.

Greetings,

Andres Freund

#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#22)
Re: cost based vacuum (parallel)

On Wed, Nov 6, 2019 at 7:55 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-11-06 07:53:09 +0530, Amit Kapila wrote:

As per feedback in this thread, it seems that for now, it is better,
if we can allow a parallel vacuum only when I/O throttling is not
enabled. We can later extend it based on feedback from the field once
the feature starts getting used.

That's not my read on this thread. I don't think we should introduce
this feature without a solution for the throttling.

Okay, then I misunderstood your response to Jeff's email [1]/messages/by-id/20191104182829.57bkz64qn5k3uwc3@alap3.anarazel.de. Anyway,
we have already explored two different approaches as mentioned in the
initial email which has somewhat similar results on initial tests.
So, we can explore more on those lines. Do you any preference or any
other idea?

[1]: /messages/by-id/20191104182829.57bkz64qn5k3uwc3@alap3.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#24Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#21)
Re: cost based vacuum (parallel)

Greetings,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

That's quite doable independent of parallelism, as we don't have tables
or indexes spanning more than one tablespace. True, you could then make
the processing of an individual vacuum faster by allowing to utilize
multiple tablespace budgets at the same time.

Yes, it's possible to do independent of parallelism, but what I was
trying to get at above is that it might not be worth the effort. When
it comes to parallel vacuum though, I'm not sure that you can just punt
on this question since you'll naturally end up spanning multiple
tablespaces concurrently, at least if the heap+indexes are spread across
multiple tablespaces and you're operating against more than one of those
relations at a time

Each parallel worker operates on a separate index. It might be worth
exploring per-tablespace vacuum throttling, but that should not be a
requirement for the currently proposed patch.

Right, that each operates on a separate index in parallel is what I had
figured was probably happening, and that's why I brought up the question
of "well, what does IO throttling mean when you've got multiple
tablespaces involved with presumably independent IO channels...?" (or,
at least, that's what I was trying to go for).

This isn't a question with the current system and way the code works
within a single vacuum operation, as we're never operating on more than
one relation concurrently in that case.

Of course, we don't currently do anything to manage IO utilization
across tablespaces when there are multiple autovacuum workers running
concurrently, which I suppose goes to Andres' point that we aren't
really doing anything to deal with this today and therefore this is
perhaps not all that new of an issue just with the addition of
parallel vacuum. I'd still argue that it becomes a lot more apparent
when you're talking about one parallel vacuum, but ultimately we should
probably be thinking about how to manage the resources across all the
vacuums and tablespaces and queries and such.

In an ideal world, we'd track the i/o from front-end queries, have some
idea of the total i/o possible for each IO channel, and allow vacuum and
whatever other background processes need to run to scale up and down,
with enough buffer to avoid ever being maxed out on i/o, but keeping up
a consistent rate of i/o that lets everything finish as quickly as
possible.

Thanks,

Stephen

#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#15)
Re: cost based vacuum (parallel)

On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

I think the problem could be that the system is not properly throttled
when it is supposed to be. Let me try by a simple example, say we
have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
w-1 is doing very less I/O but unfortunately whenever w-1 checks it
finds that cost_limit has exceeded and it goes for sleep, but w-1
still continues.

Typo in the above sentence. /but w-1 still continues/but w-2 still continues.

Now in such a situation even though we have made one
of the workers slept for a required time but ideally the worker which
was doing I/O should have slept. The aim is to make the system stop
doing I/O whenever the limit has exceeded, so that might not work in
the above situation.

One idea to fix this drawback is that if we somehow avoid letting the
workers sleep which has done less or no I/O as compared to other
workers, then we can to a good extent ensure that workers which are
doing more I/O will be throttled more. What we can do is to allow any
worker sleep only if it has performed the I/O above a certain
threshold and the overall balance is more than the cost_limit set by
the system. Then we will allow the worker to sleep proportional to
the work done by it and reduce the VacuumSharedCostBalance by the
amount which is consumed by the current worker. Something like:

If ( VacuumSharedCostBalance >= VacuumCostLimit &&
MyCostBalance > (threshold) VacuumCostLimit / workers)
{
VacuumSharedCostBalance -= MyCostBalance;
Sleep (delay * MyCostBalance/VacuumSharedCostBalance)
}

Assume threshold be 0.5, what that means is, if it has done work more
than 50% of what is expected from this worker and the overall share
cost balance is exceeded, then we will consider this worker to sleep.

What do you guys think?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#26Dilip Kumar
dilipbalaut@gmail.com
In reply to: Stephen Frost (#24)
Re: cost based vacuum (parallel)

On Wed, Nov 6, 2019 at 9:21 AM Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

That's quite doable independent of parallelism, as we don't have tables
or indexes spanning more than one tablespace. True, you could then make
the processing of an individual vacuum faster by allowing to utilize
multiple tablespace budgets at the same time.

Yes, it's possible to do independent of parallelism, but what I was
trying to get at above is that it might not be worth the effort. When
it comes to parallel vacuum though, I'm not sure that you can just punt
on this question since you'll naturally end up spanning multiple
tablespaces concurrently, at least if the heap+indexes are spread across
multiple tablespaces and you're operating against more than one of those
relations at a time

Each parallel worker operates on a separate index. It might be worth
exploring per-tablespace vacuum throttling, but that should not be a
requirement for the currently proposed patch.

Right, that each operates on a separate index in parallel is what I had
figured was probably happening, and that's why I brought up the question
of "well, what does IO throttling mean when you've got multiple
tablespaces involved with presumably independent IO channels...?" (or,
at least, that's what I was trying to go for).

This isn't a question with the current system and way the code works
within a single vacuum operation, as we're never operating on more than
one relation concurrently in that case.

Of course, we don't currently do anything to manage IO utilization
across tablespaces when there are multiple autovacuum workers running
concurrently, which I suppose goes to Andres' point that we aren't
really doing anything to deal with this today and therefore this is
perhaps not all that new of an issue just with the addition of
parallel vacuum. I'd still argue that it becomes a lot more apparent
when you're talking about one parallel vacuum, but ultimately we should
probably be thinking about how to manage the resources across all the
vacuums and tablespaces and queries and such.

In an ideal world, we'd track the i/o from front-end queries, have some
idea of the total i/o possible for each IO channel, and allow vacuum and
whatever other background processes need to run to scale up and down,
with enough buffer to avoid ever being maxed out on i/o, but keeping up
a consistent rate of i/o that lets everything finish as quickly as
possible.

IMHO, in future suppose we improve the I/O throttling for each
tablespace, maybe by maintaining the independent balance for relation
and each index of the relation or may be combined balance for the
indexes which are on the same tablespace. And, the balance can be
checked against its tablespace i/o limit. So If we get such a
mechanism in the future then it seems that it will be easily
expandable to the parallel vacuum, isn't it? Because across workers
also we can track tablespace wise shared balance (if we go with the
shared costing approach for example).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#27Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#25)
Re: cost based vacuum (parallel)

On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

I think the problem could be that the system is not properly throttled
when it is supposed to be. Let me try by a simple example, say we
have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
w-1 is doing very less I/O but unfortunately whenever w-1 checks it
finds that cost_limit has exceeded and it goes for sleep, but w-1
still continues.

Typo in the above sentence. /but w-1 still continues/but w-2 still continues.

Now in such a situation even though we have made one
of the workers slept for a required time but ideally the worker which
was doing I/O should have slept. The aim is to make the system stop
doing I/O whenever the limit has exceeded, so that might not work in
the above situation.

One idea to fix this drawback is that if we somehow avoid letting the
workers sleep which has done less or no I/O as compared to other
workers, then we can to a good extent ensure that workers which are
doing more I/O will be throttled more. What we can do is to allow any
worker sleep only if it has performed the I/O above a certain
threshold and the overall balance is more than the cost_limit set by
the system. Then we will allow the worker to sleep proportional to
the work done by it and reduce the VacuumSharedCostBalance by the
amount which is consumed by the current worker. Something like:

If ( VacuumSharedCostBalance >= VacuumCostLimit &&
MyCostBalance > (threshold) VacuumCostLimit / workers)
{
VacuumSharedCostBalance -= MyCostBalance;
Sleep (delay * MyCostBalance/VacuumSharedCostBalance)
}

Assume threshold be 0.5, what that means is, if it has done work more
than 50% of what is expected from this worker and the overall share
cost balance is exceeded, then we will consider this worker to sleep.

What do you guys think?

I think the idea that the more consuming I/O they sleep more longer
time seems good. There seems not to be the drawback of approach(b)
that is to unnecessarily delay vacuum if some indexes are very small
or bulk-deletions of indexes does almost nothing such as brin. But on
the other hand it's possible that workers don't sleep even if shared
cost balance already exceeds the limit because it's necessary for
sleeping that local balance exceeds the worker's limit divided by the
number of workers. For example, a worker is scheduled doing I/O and
exceeds the limit substantially while other 2 workers do less I/O. And
then the 2 workers are scheduled and consume I/O. The total cost
balance already exceeds the limit but the workers will not sleep as
long as the local balance is less than (limit / # of workers).

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#27)
Re: cost based vacuum (parallel)

On Fri, Nov 8, 2019 at 8:18 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

I think the problem could be that the system is not properly throttled
when it is supposed to be. Let me try by a simple example, say we
have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
w-1 is doing very less I/O but unfortunately whenever w-1 checks it
finds that cost_limit has exceeded and it goes for sleep, but w-1
still continues.

Typo in the above sentence. /but w-1 still continues/but w-2 still continues.

Now in such a situation even though we have made one
of the workers slept for a required time but ideally the worker which
was doing I/O should have slept. The aim is to make the system stop
doing I/O whenever the limit has exceeded, so that might not work in
the above situation.

One idea to fix this drawback is that if we somehow avoid letting the
workers sleep which has done less or no I/O as compared to other
workers, then we can to a good extent ensure that workers which are
doing more I/O will be throttled more. What we can do is to allow any
worker sleep only if it has performed the I/O above a certain
threshold and the overall balance is more than the cost_limit set by
the system. Then we will allow the worker to sleep proportional to
the work done by it and reduce the VacuumSharedCostBalance by the
amount which is consumed by the current worker. Something like:

If ( VacuumSharedCostBalance >= VacuumCostLimit &&
MyCostBalance > (threshold) VacuumCostLimit / workers)
{
VacuumSharedCostBalance -= MyCostBalance;
Sleep (delay * MyCostBalance/VacuumSharedCostBalance)
}

Assume threshold be 0.5, what that means is, if it has done work more
than 50% of what is expected from this worker and the overall share
cost balance is exceeded, then we will consider this worker to sleep.

What do you guys think?

I think the idea that the more consuming I/O they sleep more longer
time seems good. There seems not to be the drawback of approach(b)
that is to unnecessarily delay vacuum if some indexes are very small
or bulk-deletions of indexes does almost nothing such as brin. But on
the other hand it's possible that workers don't sleep even if shared
cost balance already exceeds the limit because it's necessary for
sleeping that local balance exceeds the worker's limit divided by the
number of workers. For example, a worker is scheduled doing I/O and
exceeds the limit substantially while other 2 workers do less I/O. And
then the 2 workers are scheduled and consume I/O. The total cost
balance already exceeds the limit but the workers will not sleep as
long as the local balance is less than (limit / # of workers).

Right, this is the reason I told to keep some threshold for local
balance(say 50% of (limit / # of workers)). I think we need to do
some experiments to see what is the best thing to do.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#28)
2 attachment(s)
Re: cost based vacuum (parallel)

On Fri, Nov 8, 2019 at 8:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 8, 2019 at 8:18 AM Masahiko Sawada

<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:

The two approaches to solve this problem being discussed in that
thread [1] are as follows:
(a) Allow the parallel workers and master backend to have a shared
view of vacuum cost related parameters (mainly VacuumCostBalance) and
allow each worker to update it and then based on that decide whether
it needs to sleep. Sawada-San has done the POC for this approach.
See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
drawback of this approach could be that we allow the worker to sleep
even though the I/O has been performed by some other worker.

I don't understand this drawback.

I think the problem could be that the system is not properly throttled
when it is supposed to be. Let me try by a simple example, say we
have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
w-1 is doing very less I/O but unfortunately whenever w-1 checks it
finds that cost_limit has exceeded and it goes for sleep, but w-1
still continues.

Typo in the above sentence. /but w-1 still continues/but w-2 still continues.

Now in such a situation even though we have made one
of the workers slept for a required time but ideally the worker which
was doing I/O should have slept. The aim is to make the system stop
doing I/O whenever the limit has exceeded, so that might not work in
the above situation.

One idea to fix this drawback is that if we somehow avoid letting the
workers sleep which has done less or no I/O as compared to other
workers, then we can to a good extent ensure that workers which are
doing more I/O will be throttled more. What we can do is to allow any
worker sleep only if it has performed the I/O above a certain
threshold and the overall balance is more than the cost_limit set by
the system. Then we will allow the worker to sleep proportional to
the work done by it and reduce the VacuumSharedCostBalance by the
amount which is consumed by the current worker. Something like:

If ( VacuumSharedCostBalance >= VacuumCostLimit &&
MyCostBalance > (threshold) VacuumCostLimit / workers)
{
VacuumSharedCostBalance -= MyCostBalance;
Sleep (delay * MyCostBalance/VacuumSharedCostBalance)
}

Assume threshold be 0.5, what that means is, if it has done work more
than 50% of what is expected from this worker and the overall share
cost balance is exceeded, then we will consider this worker to sleep.

What do you guys think?

I think the idea that the more consuming I/O they sleep more longer
time seems good. There seems not to be the drawback of approach(b)
that is to unnecessarily delay vacuum if some indexes are very small
or bulk-deletions of indexes does almost nothing such as brin. But on
the other hand it's possible that workers don't sleep even if shared
cost balance already exceeds the limit because it's necessary for
sleeping that local balance exceeds the worker's limit divided by the
number of workers. For example, a worker is scheduled doing I/O and
exceeds the limit substantially while other 2 workers do less I/O. And
then the 2 workers are scheduled and consume I/O. The total cost
balance already exceeds the limit but the workers will not sleep as
long as the local balance is less than (limit / # of workers).

Right, this is the reason I told to keep some threshold for local
balance(say 50% of (limit / # of workers)). I think we need to do
some experiments to see what is the best thing to do.

I have done some experiments on this line. I have first produced a
case where we can show the problem with the existing shared costing
patch (worker which is doing less I/O might pay the penalty on behalf
of the worker who is doing more I/O). I have also hacked the shared
costing patch of Swada-san so that worker only go for sleep if the
shared balance has crossed the limit and it's local balance has
crossed some threadshold[1]total I/O = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty.

Test setup: I have created 4 indexes on the table. Out of which 3
indexes will have a lot of pages to process but need to dirty a few
pages whereas the 4th index will have to process a very less number of
pages but need to dirty all of them. I have attached the test script
along with the mail. I have shown what is the delay time each worker
have done. What is total I/O[1]total I/O = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty each worker and what is the page hit,
page miss and page dirty count?
[1]: total I/O = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty
VacuumCostPageMiss + _ndirty * VacuumCostPageDirty

patch 1: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit)
worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603

Observation1: I think here it's clearly visible that worker 3 is
doing the least total I/O but delaying for maximum amount of time.
OTOH, worker 1 is delaying for very little time compared to how much
I/O it is doing. So for solving this problem, I have add a small
tweak to the patch. Wherein the worker will only sleep if its local
balance has crossed some threshold. And, we can see that with that
change the problem is solved up to quite an extent.

patch 2: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance >
VacuumCostLimit/number of workers)
worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603

Observation2: This patch solves the problem discussed with patch1 but
in some extreme cases there is a possibility that the shared limit can
become twice as much as local limit and still no worker goes for the
delay. For solving that there could be multiple ideas a) Set the max
limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will
give the delay whoever tries to do the I/O irrespective of its local
balance.
b) Set a little lower value for the local threshold e.g 50% of the local limit

Here I have changed the patch2 as per (b) If local balance reaches to
50% of the local limit and shared balance hit the vacuum cost limit
then go for the delay.

patch 3: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5
* VacuumCostLimit/number of workers)
worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603

Observation3: I think patch3 doesn't completely solve the issue
discussed in patch1 but its far better than patch1. But, patch 2
might have another problem as discussed in observation2.

I think I need to do some more analysis and experiment before we can
reach to some conclusion. But, one point is clear that we need to do
something to solve the problem observed with patch1 if we are going
with the shared costing approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

test.shtext/x-sh; charset=US-ASCII; name=test.shDownload
test.sqlapplication/octet-stream; name=test.sqlDownload
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#29)
Re: cost based vacuum (parallel)

On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done some experiments on this line. I have first produced a
case where we can show the problem with the existing shared costing
patch (worker which is doing less I/O might pay the penalty on behalf
of the worker who is doing more I/O). I have also hacked the shared
costing patch of Swada-san so that worker only go for sleep if the
shared balance has crossed the limit and it's local balance has
crossed some threadshold[1].

Test setup: I have created 4 indexes on the table. Out of which 3
indexes will have a lot of pages to process but need to dirty a few
pages whereas the 4th index will have to process a very less number of
pages but need to dirty all of them. I have attached the test script
along with the mail. I have shown what is the delay time each worker
have done. What is total I/O[1] each worker and what is the page hit,
page miss and page dirty count?
[1] total I/O = _nhit * VacuumCostPageHit + _nmiss *
VacuumCostPageMiss + _ndirty * VacuumCostPageDirty

patch 1: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit)
worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603

Observation1: I think here it's clearly visible that worker 3 is
doing the least total I/O but delaying for maximum amount of time.
OTOH, worker 1 is delaying for very little time compared to how much
I/O it is doing. So for solving this problem, I have add a small
tweak to the patch. Wherein the worker will only sleep if its local
balance has crossed some threshold. And, we can see that with that
change the problem is solved up to quite an extent.

patch 2: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance >
VacuumCostLimit/number of workers)
worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603

Observation2: This patch solves the problem discussed with patch1 but
in some extreme cases there is a possibility that the shared limit can
become twice as much as local limit and still no worker goes for the
delay. For solving that there could be multiple ideas a) Set the max
limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will
give the delay whoever tries to do the I/O irrespective of its local
balance.
b) Set a little lower value for the local threshold e.g 50% of the local limit

Here I have changed the patch2 as per (b) If local balance reaches to
50% of the local limit and shared balance hit the vacuum cost limit
then go for the delay.

patch 3: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5
* VacuumCostLimit/number of workers)
worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603

Observation3: I think patch3 doesn't completely solve the issue
discussed in patch1 but its far better than patch1.

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

Thanks for doing these experiments, but I think it is better if you
can share the modified patches so that others can also reproduce what
you are seeing. There is no need to post the entire parallel vacuum
patch-set, but the costing related patch can be posted with a
reference to what all patches are required from parallel vacuum
thread. Another option is to move this discussion to the parallel
vacuum thread, but I think it is better to decide the costing model
here.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#31Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#30)
4 attachment(s)
Re: cost based vacuum (parallel)

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done some experiments on this line. I have first produced a
case where we can show the problem with the existing shared costing
patch (worker which is doing less I/O might pay the penalty on behalf
of the worker who is doing more I/O). I have also hacked the shared
costing patch of Swada-san so that worker only go for sleep if the
shared balance has crossed the limit and it's local balance has
crossed some threadshold[1].

Test setup: I have created 4 indexes on the table. Out of which 3
indexes will have a lot of pages to process but need to dirty a few
pages whereas the 4th index will have to process a very less number of
pages but need to dirty all of them. I have attached the test script
along with the mail. I have shown what is the delay time each worker
have done. What is total I/O[1] each worker and what is the page hit,
page miss and page dirty count?
[1] total I/O = _nhit * VacuumCostPageHit + _nmiss *
VacuumCostPageMiss + _ndirty * VacuumCostPageDirty

patch 1: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit)
worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603

Observation1: I think here it's clearly visible that worker 3 is
doing the least total I/O but delaying for maximum amount of time.
OTOH, worker 1 is delaying for very little time compared to how much
I/O it is doing. So for solving this problem, I have add a small
tweak to the patch. Wherein the worker will only sleep if its local
balance has crossed some threshold. And, we can see that with that
change the problem is solved up to quite an extent.

patch 2: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance >
VacuumCostLimit/number of workers)
worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603

Observation2: This patch solves the problem discussed with patch1 but
in some extreme cases there is a possibility that the shared limit can
become twice as much as local limit and still no worker goes for the
delay. For solving that there could be multiple ideas a) Set the max
limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will
give the delay whoever tries to do the I/O irrespective of its local
balance.
b) Set a little lower value for the local threshold e.g 50% of the local limit

Here I have changed the patch2 as per (b) If local balance reaches to
50% of the local limit and shared balance hit the vacuum cost limit
then go for the delay.

patch 3: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5
* VacuumCostLimit/number of workers)
worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603

Observation3: I think patch3 doesn't completely solve the issue
discussed in patch1 but its far better than patch1.

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

Thanks for doing these experiments, but I think it is better if you
can share the modified patches so that others can also reproduce what
you are seeing. There is no need to post the entire parallel vacuum
patch-set, but the costing related patch can be posted with a
reference to what all patches are required from parallel vacuum
thread. Another option is to move this discussion to the parallel
vacuum thread, but I think it is better to decide the costing model
here.

I have attached the POC patches I have for testing. Step for testing
1. First, apply the parallel vacuum base patch and the shared costing patch[1]/messages/by-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD+XOh_Q@mail.gmail.com.
2. Apply 0001-vacuum_costing_test.patch attached in the mail
3. Run the script shared in previous mail [2]/messages/by-id/CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com. --> this will give the
results for patch 1 shared upthread[2]/messages/by-id/CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com
4. Apply patch shared_costing_plus_patch[2]/messages/by-id/CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com or [3] or [4] to see the
results with different approaches explained in the mail.

[1]: /messages/by-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD+XOh_Q@mail.gmail.com
[2]: /messages/by-id/CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-vacuum_costing_test.patchapplication/octet-stream; name=0001-vacuum_costing_test.patchDownload
From f7e70196d8d16515bd06af0f18a55786846b0d2b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 7 Nov 2019 11:08:37 +0530
Subject: [PATCH] vacuum_costing_test

---
 src/backend/access/heap/vacuumlazy.c | 73 ++++++++++++++++++++++++++++++++++++
 src/backend/commands/vacuum.c        |  3 ++
 src/backend/storage/buffer/bufmgr.c  | 12 ++++++
 src/backend/utils/init/globals.c     |  6 ++-
 src/include/miscadmin.h              |  4 ++
 5 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a9d9f31..1fb11e6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -137,6 +137,7 @@
 #define PARALLEL_VACUUM_KEY_SHARED			1
 #define PARALLEL_VACUUM_KEY_DEAD_TUPLES		2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
+#define PARALLEL_VACUUM_KEY_COST_DELAY		4
 
 /*
  * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION disables the leader's
@@ -257,6 +258,21 @@ typedef struct LVSharedIndStats
 #define GetIndexBulkDeleteResult(s) \
 	((IndexBulkDeleteResult *)((char *)(s) + sizeof(LVSharedIndStats)))
 
+typedef struct LVDelayStats
+{
+	double	time;
+	int		hit;
+	int		miss;
+	int		dirty;
+} LVDelayStats;
+
+typedef struct LVCostDelay
+{
+	pg_atomic_uint32	nslot;
+	LVDelayStats 		stats[FLEXIBLE_ARRAY_MEMBER];
+} LVCostDelay;
+#define SizeOfLVCostDelay offsetof(LVCostDelay, stats) + sizeof(LVDelayStats)
+
 /* Struct for parallel lazy vacuum */
 typedef struct LVParallelState
 {
@@ -265,6 +281,8 @@ typedef struct LVParallelState
 	/* Shared information among parallel vacuum workers */
 	LVShared		*lvshared;
 
+	/* Shared cost delay. */
+	LVCostDelay		*lvcostdelay;
 	/*
 	 * Always true except for a debugging case where
 	 * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION are defined.
@@ -757,6 +775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		parallel_workers = compute_parallel_workers(Irel, nindexes,
 													params->nworkers);
 
+	VacuumCostTotalDelay = 0;
 	if (parallel_workers > 0)
 	{
 		/*
@@ -1967,6 +1986,10 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 										int nindexes, IndexBulkDeleteResult **stats,
 										LVParallelState *lps)
 {
+	int		i;
+	int		io;
+	double	costdelay;
+
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
 	Assert(nindexes > 0);
@@ -1994,6 +2017,9 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 								 lps->pcxt->nworkers_launched),
 						lps->pcxt->nworkers_launched, lps->pcxt->nworkers)));
 
+	costdelay = VacuumCostTotalDelay;
+	VacuumCostTotalDelay = 0;
+	_nhit=_nmiss=_ndirty=0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
@@ -2011,6 +2037,30 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	/* Continue to use the shared balance value */
 	VacuumCostBalance = pg_atomic_read_u32(&(lps->lvshared->cost_balance));
 
+	io = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty;
+	elog(WARNING, "worker 0 delay=%lf total io=%d hit=%d miss=%d dirty=%d",
+				  VacuumCostTotalDelay, io, _nhit, _nmiss, _ndirty);
+	/*
+	 * Collect all the delay from workers and add to total delay. The delay from leader
+	 * is already included in VacuumCostTotalDelay.
+	 */
+	for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+	{
+		VacuumCostTotalDelay += lps->lvcostdelay->stats[i].time;
+		_nhit += lps->lvcostdelay->stats[i].hit;
+		_nmiss += lps->lvcostdelay->stats[i].miss;
+		_ndirty += lps->lvcostdelay->stats[i].dirty;
+		io = lps->lvcostdelay->stats[i].hit * VacuumCostPageHit +
+			 lps->lvcostdelay->stats[i].miss * VacuumCostPageMiss +
+			 lps->lvcostdelay->stats[i].dirty * VacuumCostPageDirty;
+		elog(WARNING, "worker %d delay=%lf total io=%d hit=%d miss=%d dirty=%d",
+					  i+1, lps->lvcostdelay->stats[i].time, io,
+					  lps->lvcostdelay->stats[i].hit,
+					  lps->lvcostdelay->stats[i].miss,
+					  lps->lvcostdelay->stats[i].dirty);	
+	}
+	VacuumCostTotalDelay += costdelay;
+
 	/*
 	 * We need to reinitialize the parallel context as no more index vacuuming and
 	 * index cleanup will be performed after that.
@@ -2968,10 +3018,12 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	ParallelContext *pcxt;
 	LVShared		*shared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	long	maxtuples;
 	char	*sharedquery;
 	Size	est_shared;
 	Size	est_deadtuples;
+	Size	est_costdelay;
 	int		querylen;
 	int		i;
 
@@ -3043,6 +3095,14 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	sharedquery[querylen] = '\0';
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 
+	/* Vacuum cost balance. */
+	est_costdelay = MAXALIGN(add_size(SizeOfLVCostDelay,
+								   mul_size(sizeof(LVDelayStats), nrequested)));
+	costdelay = (LVCostDelay *) shm_toc_allocate(pcxt->toc, est_costdelay);
+	pg_atomic_init_u32(&(costdelay->nslot), 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_COST_DELAY, costdelay);
+	lps->lvcostdelay = costdelay;
+
 	return lps;
 }
 
@@ -3171,8 +3231,10 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation	*indrels;
 	LVShared	*lvshared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	int			nindexes;
 	char		*sharedquery;
+	int			slot;
 	IndexBulkDeleteResult **stats;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3207,6 +3269,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	costdelay = (LVCostDelay *) shm_toc_lookup(toc,
+												   PARALLEL_VACUUM_KEY_COST_DELAY,
+												   false);
+	slot = pg_atomic_fetch_add_u32(&(costdelay->nslot), 1);
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3214,6 +3281,7 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumPageMiss = 0;
 	VacuumPageDirty = 0;
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
+	_nhit = _nmiss = _ndirty = 0;
 
 	stats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -3225,6 +3293,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacuum_or_cleanup_indexes_worker(indrels, nindexes, stats, lvshared,
 									 dead_tuples);
 
+	/* update the total delay in the shared location. */
+	costdelay->stats[slot].time = VacuumCostTotalDelay;
+	costdelay->stats[slot].hit = _nhit;
+	costdelay->stats[slot].miss = _nmiss;
+	costdelay->stats[slot].dirty = _ndirty;
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
 }
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 905d173..bb07a5a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -412,6 +412,7 @@ vacuum(List *relations, VacuumParams *params,
 		VacuumPageHit = 0;
 		VacuumPageMiss = 0;
 		VacuumPageDirty = 0;
+		_nhit = _nmiss = _ndirty = 0;
 		VacuumSharedCostBalance = NULL;
 
 		/*
@@ -2046,6 +2047,8 @@ vacuum_delay_point(void)
 			/* update balance values for workers */
 			AutoVacuumUpdateDelay();
 
+			VacuumCostTotalDelay += msec;
+
 			/* Might have gotten an interrupt while sleeping */
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad1073..1ba71c8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,7 +770,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			VacuumPageHit++;
 
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageHit;
+				_nhit++;
+			}
 
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 											  smgr->smgr_rnode.node.spcNode,
@@ -959,7 +962,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	VacuumPageMiss++;
 	if (VacuumCostActive)
+	{
 		VacuumCostBalance += VacuumCostPageMiss;
+		_nmiss++;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -1500,7 +1506,10 @@ MarkBufferDirty(Buffer buffer)
 		VacuumPageDirty++;
 		pgBufferUsage.shared_blks_dirtied++;
 		if (VacuumCostActive)
+		{
 			VacuumCostBalance += VacuumCostPageDirty;
+			_ndirty++;
+		}
 	}
 }
 
@@ -3556,7 +3565,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			VacuumPageDirty++;
 			pgBufferUsage.shared_blks_dirtied++;
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageDirty;
+				_ndirty++;
+			}
 		}
 	}
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de..de214f3 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -139,11 +139,15 @@ int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
 int			VacuumCostLimit = 200;
 double		VacuumCostDelay = 0;
-
+double		VacuumCostTotalDelay = 0;
 int			VacuumPageHit = 0;
 int			VacuumPageMiss = 0;
 int			VacuumPageDirty = 0;
 
+int	_nhit = 0;
+int _nmiss = 0;
+int _ndirty = 0;
+
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03f..8d95b6e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -251,6 +251,10 @@ extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
 extern int	VacuumCostLimit;
 extern double VacuumCostDelay;
+extern double VacuumCostTotalDelay;
+extern int	_nhit;
+extern int _nmiss;
+extern int _ndirty;
 
 extern int	VacuumPageHit;
 extern int	VacuumPageMiss;
-- 
1.8.3.1

shared_costing_plus_patch2.patchapplication/octet-stream; name=shared_costing_plus_patch2.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb11e6..f86aac0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2020,6 +2020,8 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	costdelay = VacuumCostTotalDelay;
 	VacuumCostTotalDelay = 0;
 	_nhit=_nmiss=_ndirty=0;
+	VacuumCostBalanceLocal = 0;
+
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb07a5a..a4725ec 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1989,6 +1989,7 @@ void
 vacuum_delay_point(void)
 {
 	double	msec = 0;
+	int		nworker = 4; /* Fixed value for POC */
 
 	/* Always check for interrupts */
 	CHECK_FOR_INTERRUPTS();
@@ -2011,12 +2012,14 @@ vacuum_delay_point(void)
 				/* compute new balance by adding the local value */
 				shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
 				new_balance = shared_balance + VacuumCostBalance;
-
-				if (new_balance >= VacuumCostLimit)
+				VacuumCostBalanceLocal += VacuumCostBalance;
+				if ((new_balance >= VacuumCostLimit) &&
+					(VacuumCostBalanceLocal > VacuumCostLimit/nworker))
 				{
 					/* compute sleep time based on the shared cost balance */
 					msec = VacuumCostDelay * new_balance / VacuumCostLimit;
 					new_balance %= VacuumCostLimit;
+					VacuumCostBalanceLocal = 0;
 				}
 
 				if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index de214f3..f7752a4 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int _nmiss = 0;
 int _ndirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
+int			VacuumCostBalanceLocal = 0;
 bool		VacuumCostActive = false;
 
 double		vacuum_cleanup_index_scale_factor;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8d95b6e..55f19ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,6 +261,7 @@ extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
 
 extern int	VacuumCostBalance;
+extern int	VacuumCostBalanceLocal;
 extern bool VacuumCostActive;
 
 extern double vacuum_cleanup_index_scale_factor;
shared_costing_plus_patch4.patchapplication/octet-stream; name=shared_costing_plus_patch4.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb11e6..83640b3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2020,6 +2020,7 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	costdelay = VacuumCostTotalDelay;
 	VacuumCostTotalDelay = 0;
 	_nhit=_nmiss=_ndirty=0;
+	VacuumCostBalanceLocal = 0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb07a5a..935f3e3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1989,6 +1989,7 @@ void
 vacuum_delay_point(void)
 {
 	double	msec = 0;
+	int		nworker = 4; /* Fixed value for POC */
 
 	/* Always check for interrupts */
 	CHECK_FOR_INTERRUPTS();
@@ -2011,12 +2012,14 @@ vacuum_delay_point(void)
 				/* compute new balance by adding the local value */
 				shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
 				new_balance = shared_balance + VacuumCostBalance;
-
-				if (new_balance >= VacuumCostLimit)
+				VacuumCostBalanceLocal += VacuumCostBalance;
+				if ((new_balance >= VacuumCostLimit) &&
+					(VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))
 				{
-					/* compute sleep time based on the shared cost balance */
-					msec = VacuumCostDelay * new_balance / VacuumCostLimit;
-					new_balance %= VacuumCostLimit;
+					/* compute sleep time based on the local cost balance */
+					msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
+					new_balance = shared_balance - VacuumCostBalanceLocal;
+					VacuumCostBalanceLocal = 0;
 				}
 
 				if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index de214f3..f7752a4 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int _nmiss = 0;
 int _ndirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
+int			VacuumCostBalanceLocal = 0;
 bool		VacuumCostActive = false;
 
 double		vacuum_cleanup_index_scale_factor;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8d95b6e..55f19ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,6 +261,7 @@ extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
 
 extern int	VacuumCostBalance;
+extern int	VacuumCostBalanceLocal;
 extern bool VacuumCostActive;
 
 extern double vacuum_cleanup_index_scale_factor;
shared_costing_plus_patch3.patchapplication/octet-stream; name=shared_costing_plus_patch3.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb11e6..83640b3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2020,6 +2020,7 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	costdelay = VacuumCostTotalDelay;
 	VacuumCostTotalDelay = 0;
 	_nhit=_nmiss=_ndirty=0;
+	VacuumCostBalanceLocal = 0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb07a5a..50ecc5c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1989,6 +1989,7 @@ void
 vacuum_delay_point(void)
 {
 	double	msec = 0;
+	int		nworker = 4; /* Fixed value for POC */
 
 	/* Always check for interrupts */
 	CHECK_FOR_INTERRUPTS();
@@ -2011,12 +2012,14 @@ vacuum_delay_point(void)
 				/* compute new balance by adding the local value */
 				shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
 				new_balance = shared_balance + VacuumCostBalance;
-
-				if (new_balance >= VacuumCostLimit)
+				VacuumCostBalanceLocal += VacuumCostBalance;
+				if ((new_balance >= VacuumCostLimit) &&
+					(VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))
 				{
 					/* compute sleep time based on the shared cost balance */
 					msec = VacuumCostDelay * new_balance / VacuumCostLimit;
 					new_balance %= VacuumCostLimit;
+					VacuumCostBalanceLocal = 0;
 				}
 
 				if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index de214f3..f7752a4 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int _nmiss = 0;
 int _ndirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
+int			VacuumCostBalanceLocal = 0;
 bool		VacuumCostActive = false;
 
 double		vacuum_cleanup_index_scale_factor;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8d95b6e..55f19ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,6 +261,7 @@ extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
 
 extern int	VacuumCostBalance;
+extern int	VacuumCostBalanceLocal;
 extern bool VacuumCostActive;
 
 extern double vacuum_cleanup_index_scale_factor;
#32Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#31)
2 attachment(s)
Re: cost based vacuum (parallel)

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done some experiments on this line. I have first produced a
case where we can show the problem with the existing shared costing
patch (worker which is doing less I/O might pay the penalty on behalf
of the worker who is doing more I/O). I have also hacked the shared
costing patch of Swada-san so that worker only go for sleep if the
shared balance has crossed the limit and it's local balance has
crossed some threadshold[1].

Test setup: I have created 4 indexes on the table. Out of which 3
indexes will have a lot of pages to process but need to dirty a few
pages whereas the 4th index will have to process a very less number of
pages but need to dirty all of them. I have attached the test script
along with the mail. I have shown what is the delay time each worker
have done. What is total I/O[1] each worker and what is the page hit,
page miss and page dirty count?
[1] total I/O = _nhit * VacuumCostPageHit + _nmiss *
VacuumCostPageMiss + _ndirty * VacuumCostPageDirty

patch 1: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit)
worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603

Observation1: I think here it's clearly visible that worker 3 is
doing the least total I/O but delaying for maximum amount of time.
OTOH, worker 1 is delaying for very little time compared to how much
I/O it is doing. So for solving this problem, I have add a small
tweak to the patch. Wherein the worker will only sleep if its local
balance has crossed some threshold. And, we can see that with that
change the problem is solved up to quite an extent.

patch 2: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance >
VacuumCostLimit/number of workers)
worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603

Observation2: This patch solves the problem discussed with patch1 but
in some extreme cases there is a possibility that the shared limit can
become twice as much as local limit and still no worker goes for the
delay. For solving that there could be multiple ideas a) Set the max
limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will
give the delay whoever tries to do the I/O irrespective of its local
balance.
b) Set a little lower value for the local threshold e.g 50% of the local limit

Here I have changed the patch2 as per (b) If local balance reaches to
50% of the local limit and shared balance hit the vacuum cost limit
then go for the delay.

patch 3: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5
* VacuumCostLimit/number of workers)
worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603

Observation3: I think patch3 doesn't completely solve the issue
discussed in patch1 but its far better than patch1.

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

Thanks for doing these experiments, but I think it is better if you
can share the modified patches so that others can also reproduce what
you are seeing. There is no need to post the entire parallel vacuum
patch-set, but the costing related patch can be posted with a
reference to what all patches are required from parallel vacuum
thread. Another option is to move this discussion to the parallel
vacuum thread, but I think it is better to decide the costing model
here.

I have attached the POC patches I have for testing. Step for testing
1. First, apply the parallel vacuum base patch and the shared costing patch[1].
2. Apply 0001-vacuum_costing_test.patch attached in the mail
3. Run the script shared in previous mail [2]. --> this will give the
results for patch 1 shared upthread[2]
4. Apply patch shared_costing_plus_patch[2] or [3] or [4] to see the
results with different approaches explained in the mail.

[1] /messages/by-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD+XOh_Q@mail.gmail.com
[2] /messages/by-id/CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

patch1:
worker 0 delay=120.000000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=170.000000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=210.000000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=263.400000 total io=44322 hit=8352 miss=1199 dirty=1199

patch2:
worker 0 delay=190.645000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=160.090000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=170.775000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=243.180000 total io=44322 hit=8352 miss=1199 dirty=1199

patch3:
worker 0 delay=191.765000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=180.935000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=201.305000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=192.770000 total io=44322 hit=8352 miss=1199 dirty=1199

patch4:
worker 0 delay=175.290000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=174.135000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=175.560000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=212.100000 total io=44322 hit=8352 miss=1199 dirty=1199

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

test1.shtext/x-sh; charset=US-ASCII; name=test1.shDownload
test1.sqlapplication/octet-stream; name=test1.sqlDownload
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#32)
Re: cost based vacuum (parallel)

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#34Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#33)
Re: cost based vacuum (parallel)

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".

My Bad
I think

you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

Okay, I will retest with both patch3 and path4 for both the scenarios.
I will also try with different multipliers.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#34)
Re: cost based vacuum (parallel)

On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".

My Bad
I think

you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

Okay, I will retest with both patch3 and path4 for both the scenarios.
I will also try with different multipliers.

One more thing, I think we should also test these cases with a varying
number of indexes (say 2,6,8,etc.) and then probably, we should test
by a varying number of workers where the number of workers are lesser
than indexes. You can do these after finishing your previous
experiments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#36Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#33)
1 attachment(s)
Re: cost based vacuum (parallel)

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

I have modified the patch4 and ran with different values. But, I
don't see much difference in the values with the patch4. Infact I
removed the condition for the local balancing check completely still
the delays are the same, I think this is because with patch4 worker
are only reducing their own balance and also delaying as much as their
local balance. So maybe the second condition will not have much
impact.

Patch4 (test.sh)
0
worker 0 delay=82.380000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.370000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.645000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=79.150000 total io=16378 hit=4318 miss=0 dirty=603

0.1
worker 0 delay=89.295000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.675000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.840000 total io=16378 hit=4318 miss=0 dirty=603

0.3
worker 0 delay=85.915000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=85.180000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.760000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.975000 total io=16378 hit=4318 miss=0 dirty=603

0.5
worker 0 delay=81.635000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=87.490000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.425000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=82.050000 total io=16378 hit=4318 miss=0 dirty=603

0.7
worker 0 delay=85.185000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=88.835000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=86.005000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=76.160000 total io=16378 hit=4318 miss=0 dirty=603

Patch4 (test1.sh)
0
worker 0 delay=179.005000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.900000 total io=44322 hit=8352 miss=1199 dirty=1199

0.1
worker 0 delay=177.840000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=179.465000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=179.255000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=222.695000 total io=44322 hit=8352 miss=1199 dirty=1199

0.3
worker 0 delay=178.295000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.720000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=178.270000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=220.420000 total io=44322 hit=8352 miss=1199 dirty=1199

0.5
worker 0 delay=178.415000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.385000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=173.805000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.605000 total io=44322 hit=8352 miss=1199 dirty=1199

0.7
worker 0 delay=175.330000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=177.890000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=167.540000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=216.725000 total io=44322 hit=8352 miss=1199 dirty=1199

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

shared_costing_plus_patch4.patchapplication/octet-stream; name=shared_costing_plus_patch4.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb11e6..83640b3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2020,6 +2020,7 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	costdelay = VacuumCostTotalDelay;
 	VacuumCostTotalDelay = 0;
 	_nhit=_nmiss=_ndirty=0;
+	VacuumCostBalanceLocal = 0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb07a5a..5db0e9c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1989,6 +1989,7 @@ void
 vacuum_delay_point(void)
 {
 	double	msec = 0;
+	int		nworker = 4; /* Fixed value for POC */
 
 	/* Always check for interrupts */
 	CHECK_FOR_INTERRUPTS();
@@ -2011,12 +2012,14 @@ vacuum_delay_point(void)
 				/* compute new balance by adding the local value */
 				shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
 				new_balance = shared_balance + VacuumCostBalance;
-
-				if (new_balance >= VacuumCostLimit)
+				VacuumCostBalanceLocal += VacuumCostBalance;
+				if ((new_balance >= VacuumCostLimit) &&
+					(VacuumCostBalanceLocal > 0.5 * (VacuumCostLimit/nworker)))
 				{
-					/* compute sleep time based on the shared cost balance */
-					msec = VacuumCostDelay * new_balance / VacuumCostLimit;
-					new_balance %= VacuumCostLimit;
+					/* compute sleep time based on the local cost balance */
+					msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
+					new_balance = shared_balance - VacuumCostBalanceLocal;
+					VacuumCostBalanceLocal = 0;
 				}
 
 				if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index de214f3..f7752a4 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int _nmiss = 0;
 int _ndirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
+int			VacuumCostBalanceLocal = 0;
 bool		VacuumCostActive = false;
 
 double		vacuum_cleanup_index_scale_factor;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8d95b6e..55f19ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,6 +261,7 @@ extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
 
 extern int	VacuumCostBalance;
+extern int	VacuumCostBalanceLocal;
 extern bool VacuumCostActive;
 
 extern double vacuum_cleanup_index_scale_factor;
#37Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#36)
1 attachment(s)
Re: cost based vacuum (parallel)

On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

I have modified the patch4 and ran with different values. But, I
don't see much difference in the values with the patch4. Infact I
removed the condition for the local balancing check completely still
the delays are the same, I think this is because with patch4 worker
are only reducing their own balance and also delaying as much as their
local balance. So maybe the second condition will not have much
impact.

Patch4 (test.sh)
0
worker 0 delay=82.380000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.370000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.645000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=79.150000 total io=16378 hit=4318 miss=0 dirty=603

0.1
worker 0 delay=89.295000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.675000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.840000 total io=16378 hit=4318 miss=0 dirty=603

0.3
worker 0 delay=85.915000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=85.180000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.760000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.975000 total io=16378 hit=4318 miss=0 dirty=603

0.5
worker 0 delay=81.635000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=87.490000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.425000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=82.050000 total io=16378 hit=4318 miss=0 dirty=603

0.7
worker 0 delay=85.185000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=88.835000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=86.005000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=76.160000 total io=16378 hit=4318 miss=0 dirty=603

Patch4 (test1.sh)
0
worker 0 delay=179.005000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.900000 total io=44322 hit=8352 miss=1199 dirty=1199

0.1
worker 0 delay=177.840000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=179.465000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=179.255000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=222.695000 total io=44322 hit=8352 miss=1199 dirty=1199

0.3
worker 0 delay=178.295000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.720000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=178.270000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=220.420000 total io=44322 hit=8352 miss=1199 dirty=1199

0.5
worker 0 delay=178.415000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.385000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=173.805000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.605000 total io=44322 hit=8352 miss=1199 dirty=1199

0.7
worker 0 delay=175.330000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=177.890000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=167.540000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=216.725000 total io=44322 hit=8352 miss=1199 dirty=1199

I have revised the patch4 so that it doesn't depent upon the fix
number of workers, instead I have dynamically updated the worker
count.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

shared_costing_plus_patch4_v1.patchapplication/octet-stream; name=shared_costing_plus_patch4_v1.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb11e6..7989fa5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -220,6 +220,13 @@ typedef struct LVShared
 	pg_atomic_uint32 cost_balance;
 
 	/*
+	 * Number of active parallel workers.  This is used for computing the
+	 * minimum threshold of the vacuum cost balance for a worker to go
+	 * for the delay.
+	 */
+	pg_atomic_uint32 nworkers;
+
+	/*
 	 * Variables to control parallel index vacuuming.  Index statistics
 	 * returned from ambulkdelete and amvacuumcleanup is nullable variable
 	 * length.  'offset' is NULL bitmap. Note that a 0 indicates a null,
@@ -241,6 +248,9 @@ typedef struct LVShared
 /* Global variable for shared cost-based vacuum delay */
 pg_atomic_uint32	*VacuumSharedCostBalance = NULL;
 
+/* Active worker count for shared cost-based vacuum delay */
+pg_atomic_uint32	*VacuumActiveWorkers = NULL;
+
 /*
  * Struct for an index bulk-deletion statistic used for parallel lazy
  * vacuum. This is allocated in the DSM segment.  IndexBulkDeleteResult
@@ -1999,9 +2009,10 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	 * balance.
 	 */
 	pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
+	pg_atomic_write_u32(&(lps->lvshared->nworkers), 0);
 	VacuumCostBalance = 0;
 	VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
-
+	VacuumActiveWorkers = &(lps->lvshared->nworkers);
 	LaunchParallelWorkers(lps->pcxt);
 
 	if (lps->lvshared->for_cleanup)
@@ -2020,13 +2031,24 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	costdelay = VacuumCostTotalDelay;
 	VacuumCostTotalDelay = 0;
 	_nhit=_nmiss=_ndirty=0;
+	VacuumCostBalanceLocal = 0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
 	 */
 	if (lps->leaderparticipates || lps->pcxt->nworkers_launched == 0)
+	{
+		/* Increment the active worker count. */
+		(void) pg_atomic_add_fetch_u32(VacuumActiveWorkers, 1);
+		
 		vacuum_or_cleanup_indexes_worker(Irel, nindexes, stats, lps->lvshared,
 										 vacrelstats->dead_tuples);
+		/*
+		 * We have completed the index vacuum so decrement the active worker
+		 * count.
+		 */
+		(void) pg_atomic_sub_fetch_u32(VacuumActiveWorkers, 1);		
+	}
 
 	/* Wait for all vacuum workers to finish */
 	WaitForParallelWorkersToFinish(lps->pcxt);
@@ -3077,6 +3099,7 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	prepare_index_statistics(shared, Irel, nindexes);
 	pg_atomic_init_u32(&(shared->nprocessed), 0);
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
+	pg_atomic_init_u32(&(shared->nworkers), 0);
 
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	lps->lvshared = shared;
@@ -3289,10 +3312,20 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	if (lvshared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = lvshared->maintenance_work_mem_worker;
 
+	VacuumActiveWorkers = &(lvshared->nworkers);
+
+	/* Increment the active worker count. */
+	(void) pg_atomic_add_fetch_u32(VacuumActiveWorkers, 1);
+
 	/* Do either vacuuming indexes or cleaning indexes */
 	vacuum_or_cleanup_indexes_worker(indrels, nindexes, stats, lvshared,
 									 dead_tuples);
 
+	/*
+	 * We have completed the index vacuum so decrement the active worker count.
+	 */
+	(void) pg_atomic_sub_fetch_u32(VacuumActiveWorkers, 1);
+
 	/* update the total delay in the shared location. */
 	costdelay->stats[slot].time = VacuumCostTotalDelay;
 	costdelay->stats[slot].hit = _nhit;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb07a5a..2a2cae2 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2001,6 +2001,8 @@ vacuum_delay_point(void)
 		 */
 		if (VacuumSharedCostBalance != NULL)
 		{
+			int nworkers = pg_atomic_read_u32(VacuumActiveWorkers);
+
 			while (true)
 			{
 				uint32 shared_balance;
@@ -2011,12 +2013,14 @@ vacuum_delay_point(void)
 				/* compute new balance by adding the local value */
 				shared_balance = pg_atomic_read_u32(VacuumSharedCostBalance);
 				new_balance = shared_balance + VacuumCostBalance;
-
-				if (new_balance >= VacuumCostLimit)
+				VacuumCostBalanceLocal += VacuumCostBalance;
+				if ((new_balance >= VacuumCostLimit) &&
+					(VacuumCostBalanceLocal > 0.5 * (VacuumCostLimit/nworkers)))
 				{
-					/* compute sleep time based on the shared cost balance */
-					msec = VacuumCostDelay * new_balance / VacuumCostLimit;
-					new_balance %= VacuumCostLimit;
+					/* compute sleep time based on the local cost balance */
+					msec = VacuumCostDelay * VacuumCostBalanceLocal / VacuumCostLimit;
+					new_balance = shared_balance - VacuumCostBalanceLocal;
+					VacuumCostBalanceLocal = 0;
 				}
 
 				if (pg_atomic_compare_exchange_u32(VacuumSharedCostBalance,
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index de214f3..f7752a4 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int _nmiss = 0;
 int _ndirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
+int			VacuumCostBalanceLocal = 0;
 bool		VacuumCostActive = false;
 
 double		vacuum_cleanup_index_scale_factor;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ac883f6..81223ed 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -193,6 +193,7 @@ extern Size SyncScanShmemSize(void);
 
 /* in heap/vacuumlazy.c */
 extern pg_atomic_uint32	*VacuumSharedCostBalance;
+extern pg_atomic_uint32	*VacuumActiveWorkers;
 struct VacuumParams;
 extern void heap_vacuum_rel(Relation onerel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8d95b6e..55f19ec 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,6 +261,7 @@ extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
 
 extern int	VacuumCostBalance;
+extern int	VacuumCostBalanceLocal;
 extern bool VacuumCostActive;
 
 extern double vacuum_cleanup_index_scale_factor;
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#37)
Re: cost based vacuum (parallel)

On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

I have modified the patch4 and ran with different values. But, I
don't see much difference in the values with the patch4. Infact I
removed the condition for the local balancing check completely still
the delays are the same, I think this is because with patch4 worker
are only reducing their own balance and also delaying as much as their
local balance. So maybe the second condition will not have much
impact.

Yeah, but I suspect the condition (when the local balance exceeds a
certain threshold, then only try to perform delay) you mentioned can
have an impact in some other scenarios. So, it is better to retain
the same. I feel the overall results look sane and the approach seems
reasonable to me.

I have revised the patch4 so that it doesn't depent upon the fix
number of workers, instead I have dynamically updated the worker
count.

Thanks. Sawada-San, by any chance, can you try some of the tests done
by Dilip or some similar tests just to rule out any sort of
machine-specific dependency?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#39Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#38)
Re: cost based vacuum (parallel)

On Tue, 12 Nov 2019 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

I have modified the patch4 and ran with different values. But, I
don't see much difference in the values with the patch4. Infact I
removed the condition for the local balancing check completely still
the delays are the same, I think this is because with patch4 worker
are only reducing their own balance and also delaying as much as their
local balance. So maybe the second condition will not have much
impact.

Yeah, but I suspect the condition (when the local balance exceeds a
certain threshold, then only try to perform delay) you mentioned can
have an impact in some other scenarios. So, it is better to retain
the same. I feel the overall results look sane and the approach seems
reasonable to me.

I have revised the patch4 so that it doesn't depent upon the fix
number of workers, instead I have dynamically updated the worker
count.

Thanks. Sawada-San, by any chance, can you try some of the tests done
by Dilip or some similar tests just to rule out any sort of
machine-specific dependency?

Sure. I'll try it tomorrow.

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#40Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Masahiko Sawada (#39)
Re: cost based vacuum (parallel)

On Tue, 12 Nov 2019 at 20:22, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Tue, 12 Nov 2019 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think it is difficult to get the exact balance, but we can try
to be as close as possible. We can try to play with the threshold and
another possibility is to try to sleep in proportion to the amount of
I/O done by the worker.

I have done another experiment where I have done another 2 changes on
top op patch3
a) Only reduce the local balance from the total shared balance
whenever it's applying delay
b) Compute the delay based on the local balance.

patch4:
worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603

I think with this approach the delay is divided among the worker quite
well compared to other approaches

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think
you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

I have modified the patch4 and ran with different values. But, I
don't see much difference in the values with the patch4. Infact I
removed the condition for the local balancing check completely still
the delays are the same, I think this is because with patch4 worker
are only reducing their own balance and also delaying as much as their
local balance. So maybe the second condition will not have much
impact.

Yeah, but I suspect the condition (when the local balance exceeds a
certain threshold, then only try to perform delay) you mentioned can
have an impact in some other scenarios. So, it is better to retain
the same. I feel the overall results look sane and the approach seems
reasonable to me.

I have revised the patch4 so that it doesn't depent upon the fix
number of workers, instead I have dynamically updated the worker
count.

Thanks. Sawada-San, by any chance, can you try some of the tests done
by Dilip or some similar tests just to rule out any sort of
machine-specific dependency?

Sure. I'll try it tomorrow.

I've done some tests while changing shared buffer size, delays and
number of workers. The overall results has the similar tendency as the
result shared by Dilip and looks reasonable to me.

* test.sh

shared_buffers = '4GB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=89.315000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=88.860000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=89.290000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.805000 total io=16378 hit=4318 miss=0 dirty=603

shared_buffers = '1GB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=89.210000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=89.325000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.870000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.735000 total io=16378 hit=4318 miss=0 dirty=603

shared_buffers = '512MB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=88.480000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=88.635000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=88.600000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=81.660000 total io=16378 hit=4318 miss=0 dirty=603

shared_buffers = '512MB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 5;
worker 0 delay=447.725000 total io=17931 hit=17891 miss=0 dirty=2
worker 1 delay=445.850000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=445.125000 total io=17931 hit=17891 miss=0 dirty=2
worker 3 delay=409.025000 total io=16378 hit=4318 miss=0 dirty=603

shared_buffers = '512MB';
max_parallel_maintenance_workers = 2;
vacuum_cost_delay = 5;
worker 0 delay=854.750000 total io=34309 hit=22209 miss=0 dirty=605
worker 1 delay=446.500000 total io=17931 hit=17891 miss=0 dirty=2
worker 2 delay=444.175000 total io=17931 hit=17891 miss=0 dirty=2

---
* test1.sh

shared_buffers = '4GB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=178.205000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.550000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=178.660000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.280000 total io=44322 hit=8352 miss=1199 dirty=1199

shared_buffers = '1GB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=178.035000 total io=35828 hit=35788 miss=0 dirty=2
worker 1 delay=178.535000 total io=35828 hit=35788 miss=0 dirty=2
worker 2 delay=178.585000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.465000 total io=44322 hit=8352 miss=1199 dirty=1199

shared_buffers = '512MB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 1;
worker 0 delay=1795.900000 total io=357911 hit=1 miss=35787 dirty=2
worker 1 delay=1790.700000 total io=357911 hit=1 miss=35787 dirty=2
worker 2 delay=179.000000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=221.355000 total io=44322 hit=8352 miss=1199 dirty=1199

shared_buffers = '512MB';
max_parallel_maintenance_workers = 6;
vacuum_cost_delay = 5;
worker 0 delay=8958.500000 total io=357911 hit=1 miss=35787 dirty=2
worker 1 delay=8950.000000 total io=357911 hit=1 miss=35787 dirty=2
worker 2 delay=894.150000 total io=35828 hit=35788 miss=0 dirty=2
worker 3 delay=1106.400000 total io=44322 hit=8352 miss=1199 dirty=1199

shared_buffers = '512MB';
max_parallel_maintenance_workers = 2;
vacuum_cost_delay = 5;
worker 0 delay=8956.500000 total io=357911 hit=1 miss=35787 dirty=2
worker 1 delay=8955.050000 total io=357893 hit=3 miss=35785 dirty=2
worker 2 delay=2002.825000 total io=80150 hit=44140 miss=1199 dirty=1201

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#41Mahendra Singh
mahi6run@gmail.com
In reply to: Amit Kapila (#35)
Re: cost based vacuum (parallel)

On Mon, 11 Nov 2019 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with

the

patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".

My Bad
I think

you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

Okay, I will retest with both patch3 and path4 for both the scenarios.
I will also try with different multipliers.

One more thing, I think we should also test these cases with a varying
number of indexes (say 2,6,8,etc.) and then probably, we should test
by a varying number of workers where the number of workers are lesser
than indexes. You can do these after finishing your previous
experiments.

On the top of parallel vacuum patch, I applied Dilip's
patch(0001-vacuum_costing_test.patch). I have tested by varying number of
indexes and number of workers. I compared shared
costing(0001-vacuum_costing_test.patch) vs shared costing latest
patch(shared_costing_plus_patch4_v1.patch).
With shared costing base patch, I can see that delay is not in sync
compared to I/O which is resolved by applying patch
(shared_costing_plus_patch4_v1.patch). I have also observed that total
delay is slightly reduced with shared_costing_plus_patch4_v1.patch patch.

Below is the full testing summary:
*Test setup:*
step1) Apply parallel vacuum patch
step2) Apply 0001-vacuum_costing_test.patch patch (on the top of this
patch, delay is not in sync compared to I/O)
step3) Apply shared_costing_plus_patch4_v1.patch (delay is in sync compared
to I/O)

*Configuration settings:*
autovacuum = off
max_parallel_workers = 30
shared_buffers = 2GB
max_parallel_maintenance_workers = 20
vacuum_cost_limit = 2000
vacuum_cost_delay = 10

*Test 1: Varry indexes(2,4,6,8) but parallel workers are fixed as 4:*

Case 1) When indexes are 2:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2

*With shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=87.780000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.995000 total io=17931 hit=17891 miss=0 dirty=2

Case 2) When indexes are 4:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2

*With shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=87.430000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.175000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=86.340000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=88.020000 total io=17931 hit=17891 miss=0 dirty=2

Case 3) When indexes are 6:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=110.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=160.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=90.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2

*With shared_costing_plus_patch4_v1.patch*:
WARNING: worker 0 delay=173.195000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=88.715000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=87.710000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=86.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=89.435000 total io=17931 hit=17891 miss=0 dirty=2

Case 4) When indexes are 8:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=130.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=190.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=110.000000 total io=35862 hit=35782 miss=0 dirty=4

*With shared_costing_plus_patch4_v1.patch*:
WARNING: worker 0 delay=174.700000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=177.880000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 2 delay=89.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=177.320000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=86.810000 total io=17931 hit=17891 miss=0 dirty=2

*Test 2: Indexes are 16 but parallel workers are 2,4,8:*

Case 1) When 2 parallel workers:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=1513.230000 total io=307197 hit=85167 miss=22179
dirty=12
WARNING: worker 1 delay=1543.385000 total io=326553 hit=63133 miss=26322
dirty=10
WARNING: worker 2 delay=1633.625000 total io=302199 hit=65839 miss=23616
dirty=10

*With shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=1539.475000 total io=308175 hit=65175 miss=24280
dirty=10
WARNING: worker 1 delay=1251.200000 total io=250692 hit=71562 miss=17893
dirty=10
WARNING: worker 2 delay=1143.690000 total io=228987 hit=93857 miss=13489
dirty=12

Case 2) When 4 parallel workers:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=1182.430000 total io=213567 hit=16037 miss=19745
dirty=4
WARNING: worker 1 delay=1202.710000 total io=178941 hit=1 miss=17890
dirty=2
WARNING: worker 2 delay=210.000000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=270.000000 total io=71724 hit=71564 miss=0 dirty=8
WARNING: worker 4 delay=851.825000 total io=188229 hit=58619 miss=12945
dirty=8

*With shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=1136.875000 total io=227679 hit=14469 miss=21313
dirty=4
WARNING: worker 1 delay=973.745000 total io=196881 hit=17891 miss=17891
dirty=4
WARNING: worker 2 delay=447.410000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=833.235000 total io=168228 hit=40958 miss=12715
dirty=6
WARNING: worker 4 delay=683.200000 total io=136488 hit=64368 miss=7196
dirty=8

Case 3) When 8 parallel workers:
*Without shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=1022.300000 total io=178941 hit=1 miss=17890
dirty=2
WARNING: worker 1 delay=1072.770000 total io=178941 hit=1 miss=17890
dirty=2
WARNING: worker 2 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=140.035000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=200.000000 total io=53802 hit=53672 miss=1 dirty=6
WARNING: worker 6 delay=130.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=150.000000 total io=53793 hit=53673 miss=0 dirty=6

*With shared_costing_plus_patch4_v1.patch:*
WARNING: worker 0 delay=872.800000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 1 delay=885.950000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=175.680000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=259.560000 total io=53793 hit=53673 miss=0 dirty=6
WARNING: worker 4 delay=169.945000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=613.845000 total io=125100 hit=45750 miss=7923
dirty=6
WARNING: worker 6 delay=171.895000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=176.505000 total io=35862 hit=35782 miss=0 dirty=4

Thanks and Regards
Mahendra Thalor
EnterpriseDB: http://www.enterprisedb.com

#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#40)
Re: cost based vacuum (parallel)

On Wed, Nov 13, 2019 at 10:02 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

I've done some tests while changing shared buffer size, delays and
number of workers. The overall results has the similar tendency as the
result shared by Dilip and looks reasonable to me.

Thanks, Sawada-san for repeating the tests. I can see from yours,
Dilip and Mahendra's testing that the delay is distributed depending
upon the I/O done by a particular worker and the total I/O is also as
expected in various kinds of scenarios. So, I think this is a better
approach. Do you agree or you think we should still investigate more
on another approach as well?

I would like to summarize this approach. The basic idea for parallel
vacuum is to allow the parallel workers and master backend to have a
shared view of vacuum cost related parameters (mainly
VacuumCostBalance) and allow each worker to update it and then based
on that decide whether it needs to sleep. With this basic idea, we
found that in some cases the throttling is not accurate as explained
with an example in my email above [1]/messages/by-id/CAA4eK1JvxBTWTPqHGx1X7in7j42ZYwuKOZUySzH3YMwTNRE-2Q@mail.gmail.com and then the tests performed by
Dilip and others in the following emails (In short, the workers doing
more I/O can be throttled less). Then as discussed in an email later
[2]: /messages/by-id/CAA4eK1K9kCqLKbVA9KUuuarjj+sNYqrmf6UAFok5VTgZ8evWoA@mail.gmail.com
less or no I/O as compared to other workers. This ensured that
workers who are doing more I/O got throttled more. The idea is to
allow any worker to sleep only if it has performed the I/O above a
certain threshold and the overall balance is more than the cost_limit
set by the system. Then we will allow the worker to sleep
proportional to the work done by it and reduce the
VacuumSharedCostBalance by the amount which is consumed by the current
worker. This scheme leads to the desired throttling by different
workers based on the work done by the individual worker.

We have tested this idea with various kinds of workloads like by
varying shared buffer size, delays and number of workers. Then also,
we have tried with a different number of indexes and workers. In all
the tests, we found that the workers are throttled proportional to the
I/O being done by a particular worker.

[1]: /messages/by-id/CAA4eK1JvxBTWTPqHGx1X7in7j42ZYwuKOZUySzH3YMwTNRE-2Q@mail.gmail.com
[2]: /messages/by-id/CAA4eK1K9kCqLKbVA9KUuuarjj+sNYqrmf6UAFok5VTgZ8evWoA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#43Dilip Kumar
dilipbalaut@gmail.com
In reply to: Mahendra Singh (#41)
1 attachment(s)
Re: cost based vacuum (parallel)

On Thu, Nov 14, 2019 at 5:02 PM Mahendra Singh <mahi6run@gmail.com> wrote:

On Mon, 11 Nov 2019 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

..

I have tested the same with some other workload(test file attached).
I can see the same behaviour with this workload as well that with the
patch 4 the distribution of the delay is better compared to other
patches i.e. worker with more I/O have more delay and with equal IO
have alsomost equal delay. Only thing is that the total delay with
the patch 4 is slightly less compared to other pacthes.

I see one problem with the formula you have used in the patch, maybe
that is causing the value of total delay to go down.

- if (new_balance >= VacuumCostLimit)
+ VacuumCostBalanceLocal += VacuumCostBalance;
+ if ((new_balance >= VacuumCostLimit) &&
+ (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))

As per discussion, the second part of the condition should be
"VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".

My Bad
I think

you can once change this and try again. Also, please try with the
different values of threshold (0.3, 0.5, 0.7, etc.).

Okay, I will retest with both patch3 and path4 for both the scenarios.
I will also try with different multipliers.

One more thing, I think we should also test these cases with a varying
number of indexes (say 2,6,8,etc.) and then probably, we should test
by a varying number of workers where the number of workers are lesser
than indexes. You can do these after finishing your previous
experiments.

On the top of parallel vacuum patch, I applied Dilip's patch(0001-vacuum_costing_test.patch). I have tested by varying number of indexes and number of workers. I compared shared costing(0001-vacuum_costing_test.patch) vs shared costing latest patch(shared_costing_plus_patch4_v1.patch).
With shared costing base patch, I can see that delay is not in sync compared to I/O which is resolved by applying patch (shared_costing_plus_patch4_v1.patch). I have also observed that total delay is slightly reduced with shared_costing_plus_patch4_v1.patch patch.

Below is the full testing summary:
Test setup:
step1) Apply parallel vacuum patch
step2) Apply 0001-vacuum_costing_test.patch patch (on the top of this patch, delay is not in sync compared to I/O)
step3) Apply shared_costing_plus_patch4_v1.patch (delay is in sync compared to I/O)

Configuration settings:
autovacuum = off
max_parallel_workers = 30
shared_buffers = 2GB
max_parallel_maintenance_workers = 20
vacuum_cost_limit = 2000
vacuum_cost_delay = 10

Test 1: Varry indexes(2,4,6,8) but parallel workers are fixed as 4:

Case 1) When indexes are 2:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=87.780000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.995000 total io=17931 hit=17891 miss=0 dirty=2

Case 2) When indexes are 4:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=87.430000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.175000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=86.340000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=88.020000 total io=17931 hit=17891 miss=0 dirty=2

Case 3) When indexes are 6:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=110.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=160.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=90.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=173.195000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=88.715000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=87.710000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=86.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=89.435000 total io=17931 hit=17891 miss=0 dirty=2

Case 4) When indexes are 8:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=130.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=190.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=110.000000 total io=35862 hit=35782 miss=0 dirty=4

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=174.700000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=177.880000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 2 delay=89.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=177.320000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=86.810000 total io=17931 hit=17891 miss=0 dirty=2

Test 2: Indexes are 16 but parallel workers are 2,4,8:

Case 1) When 2 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1513.230000 total io=307197 hit=85167 miss=22179 dirty=12
WARNING: worker 1 delay=1543.385000 total io=326553 hit=63133 miss=26322 dirty=10
WARNING: worker 2 delay=1633.625000 total io=302199 hit=65839 miss=23616 dirty=10

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1539.475000 total io=308175 hit=65175 miss=24280 dirty=10
WARNING: worker 1 delay=1251.200000 total io=250692 hit=71562 miss=17893 dirty=10
WARNING: worker 2 delay=1143.690000 total io=228987 hit=93857 miss=13489 dirty=12

Case 2) When 4 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1182.430000 total io=213567 hit=16037 miss=19745 dirty=4
WARNING: worker 1 delay=1202.710000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=210.000000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=270.000000 total io=71724 hit=71564 miss=0 dirty=8
WARNING: worker 4 delay=851.825000 total io=188229 hit=58619 miss=12945 dirty=8

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1136.875000 total io=227679 hit=14469 miss=21313 dirty=4
WARNING: worker 1 delay=973.745000 total io=196881 hit=17891 miss=17891 dirty=4
WARNING: worker 2 delay=447.410000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=833.235000 total io=168228 hit=40958 miss=12715 dirty=6
WARNING: worker 4 delay=683.200000 total io=136488 hit=64368 miss=7196 dirty=8

Case 3) When 8 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1022.300000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 1 delay=1072.770000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=140.035000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=200.000000 total io=53802 hit=53672 miss=1 dirty=6
WARNING: worker 6 delay=130.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=150.000000 total io=53793 hit=53673 miss=0 dirty=6

With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=872.800000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 1 delay=885.950000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=175.680000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=259.560000 total io=53793 hit=53673 miss=0 dirty=6
WARNING: worker 4 delay=169.945000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=613.845000 total io=125100 hit=45750 miss=7923 dirty=6
WARNING: worker 6 delay=171.895000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=176.505000 total io=35862 hit=35782 miss=0 dirty=4

It seems that the bigger delay difference (8% - 9 %), which is
observed with the higher number of indexes is due to the IO
difference, for example in case3, the total page miss without patch is
35780 whereas with the patch it is 43703. So it seems that with more
indexes your data is not fitting in the shared buffer so the page
hits/misses are varying run to run and that will cause variance in the
total delay. Another problem where delay with the patch is 2-3%
lesser, is basically the problem of the "0001-vacuum_costing_test"
patch because that patch is only displaying the delay during the index
vacuuming phase, not the total delay. So if we observe the total
delay then it should be the same. The modified version of
0001-vacuum_costing_test is attached to print the total delay.

In my test.sh, I can see the total delay is almost the same.

Non-parallel vacuum
WARNING: VacuumCostTotalDelay = 11332.170000

Paralle vacuum with shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=85.205000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=87.290000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=78.365000 total io=16378 hit=4318 miss=0 dirty=603

WARNING: VacuumCostTotalDelay = 11331.690000

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-vacuum_costing_test_total_delay.patchapplication/octet-stream; name=0001-vacuum_costing_test_total_delay.patchDownload
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a9d9f31..50b7b1d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -137,6 +137,7 @@
 #define PARALLEL_VACUUM_KEY_SHARED			1
 #define PARALLEL_VACUUM_KEY_DEAD_TUPLES		2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
+#define PARALLEL_VACUUM_KEY_COST_DELAY		4
 
 /*
  * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION disables the leader's
@@ -257,6 +258,21 @@ typedef struct LVSharedIndStats
 #define GetIndexBulkDeleteResult(s) \
 	((IndexBulkDeleteResult *)((char *)(s) + sizeof(LVSharedIndStats)))
 
+typedef struct LVDelayStats
+{
+	double	time;
+	int		hit;
+	int		miss;
+	int		dirty;
+} LVDelayStats;
+
+typedef struct LVCostDelay
+{
+	pg_atomic_uint32	nslot;
+	LVDelayStats 		stats[FLEXIBLE_ARRAY_MEMBER];
+} LVCostDelay;
+#define SizeOfLVCostDelay offsetof(LVCostDelay, stats) + sizeof(LVDelayStats)
+
 /* Struct for parallel lazy vacuum */
 typedef struct LVParallelState
 {
@@ -265,6 +281,8 @@ typedef struct LVParallelState
 	/* Shared information among parallel vacuum workers */
 	LVShared		*lvshared;
 
+	/* Shared cost delay. */
+	LVCostDelay		*lvcostdelay;
 	/*
 	 * Always true except for a debugging case where
 	 * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION are defined.
@@ -757,6 +775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		parallel_workers = compute_parallel_workers(Irel, nindexes,
 													params->nworkers);
 
+	VacuumCostTotalDelay = 0;
 	if (parallel_workers > 0)
 	{
 		/*
@@ -1732,6 +1751,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					tups_vacuumed, num_tuples,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
+	elog(WARNING, "VacuumCostTotalDelay=%lf", VacuumCostTotalDelay);
 	pfree(buf.data);
 }
 
@@ -1967,6 +1987,10 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 										int nindexes, IndexBulkDeleteResult **stats,
 										LVParallelState *lps)
 {
+	int		i;
+	int		io;
+	double	costdelay;
+
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
 	Assert(nindexes > 0);
@@ -1994,6 +2018,9 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 								 lps->pcxt->nworkers_launched),
 						lps->pcxt->nworkers_launched, lps->pcxt->nworkers)));
 
+	costdelay = VacuumCostTotalDelay;
+	VacuumCostTotalDelay = 0;
+	_nhit=_nmiss=_ndirty=0;
 	/*
 	 * Join as parallel workers. The leader process alone does that in case where
 	 * no workers launched.
@@ -2011,6 +2038,30 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	/* Continue to use the shared balance value */
 	VacuumCostBalance = pg_atomic_read_u32(&(lps->lvshared->cost_balance));
 
+	io = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty;
+	elog(WARNING, "worker 0 delay=%lf total io=%d hit=%d miss=%d dirty=%d",
+				  VacuumCostTotalDelay, io, _nhit, _nmiss, _ndirty);
+	/*
+	 * Collect all the delay from workers and add to total delay. The delay from leader
+	 * is already included in VacuumCostTotalDelay.
+	 */
+	for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+	{
+		VacuumCostTotalDelay += lps->lvcostdelay->stats[i].time;
+		_nhit += lps->lvcostdelay->stats[i].hit;
+		_nmiss += lps->lvcostdelay->stats[i].miss;
+		_ndirty += lps->lvcostdelay->stats[i].dirty;
+		io = lps->lvcostdelay->stats[i].hit * VacuumCostPageHit +
+			 lps->lvcostdelay->stats[i].miss * VacuumCostPageMiss +
+			 lps->lvcostdelay->stats[i].dirty * VacuumCostPageDirty;
+		elog(WARNING, "worker %d delay=%lf total io=%d hit=%d miss=%d dirty=%d",
+					  i+1, lps->lvcostdelay->stats[i].time, io,
+					  lps->lvcostdelay->stats[i].hit,
+					  lps->lvcostdelay->stats[i].miss,
+					  lps->lvcostdelay->stats[i].dirty);	
+	}
+	VacuumCostTotalDelay += costdelay;
+
 	/*
 	 * We need to reinitialize the parallel context as no more index vacuuming and
 	 * index cleanup will be performed after that.
@@ -2968,10 +3019,12 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	ParallelContext *pcxt;
 	LVShared		*shared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	long	maxtuples;
 	char	*sharedquery;
 	Size	est_shared;
 	Size	est_deadtuples;
+	Size	est_costdelay;
 	int		querylen;
 	int		i;
 
@@ -3043,6 +3096,14 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	sharedquery[querylen] = '\0';
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 
+	/* Vacuum cost balance. */
+	est_costdelay = MAXALIGN(add_size(SizeOfLVCostDelay,
+								   mul_size(sizeof(LVDelayStats), nrequested)));
+	costdelay = (LVCostDelay *) shm_toc_allocate(pcxt->toc, est_costdelay);
+	pg_atomic_init_u32(&(costdelay->nslot), 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_COST_DELAY, costdelay);
+	lps->lvcostdelay = costdelay;
+
 	return lps;
 }
 
@@ -3171,8 +3232,10 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation	*indrels;
 	LVShared	*lvshared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	int			nindexes;
 	char		*sharedquery;
+	int			slot;
 	IndexBulkDeleteResult **stats;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3207,6 +3270,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	costdelay = (LVCostDelay *) shm_toc_lookup(toc,
+												   PARALLEL_VACUUM_KEY_COST_DELAY,
+												   false);
+	slot = pg_atomic_fetch_add_u32(&(costdelay->nslot), 1);
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3214,6 +3282,7 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumPageMiss = 0;
 	VacuumPageDirty = 0;
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
+	_nhit = _nmiss = _ndirty = 0;
 
 	stats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -3225,6 +3294,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacuum_or_cleanup_indexes_worker(indrels, nindexes, stats, lvshared,
 									 dead_tuples);
 
+	/* update the total delay in the shared location. */
+	costdelay->stats[slot].time = VacuumCostTotalDelay;
+	costdelay->stats[slot].hit = _nhit;
+	costdelay->stats[slot].miss = _nmiss;
+	costdelay->stats[slot].dirty = _ndirty;
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
 }
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 905d173..bb07a5a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -412,6 +412,7 @@ vacuum(List *relations, VacuumParams *params,
 		VacuumPageHit = 0;
 		VacuumPageMiss = 0;
 		VacuumPageDirty = 0;
+		_nhit = _nmiss = _ndirty = 0;
 		VacuumSharedCostBalance = NULL;
 
 		/*
@@ -2046,6 +2047,8 @@ vacuum_delay_point(void)
 			/* update balance values for workers */
 			AutoVacuumUpdateDelay();
 
+			VacuumCostTotalDelay += msec;
+
 			/* Might have gotten an interrupt while sleeping */
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad1073..1ba71c8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,7 +770,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			VacuumPageHit++;
 
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageHit;
+				_nhit++;
+			}
 
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 											  smgr->smgr_rnode.node.spcNode,
@@ -959,7 +962,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	VacuumPageMiss++;
 	if (VacuumCostActive)
+	{
 		VacuumCostBalance += VacuumCostPageMiss;
+		_nmiss++;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -1500,7 +1506,10 @@ MarkBufferDirty(Buffer buffer)
 		VacuumPageDirty++;
 		pgBufferUsage.shared_blks_dirtied++;
 		if (VacuumCostActive)
+		{
 			VacuumCostBalance += VacuumCostPageDirty;
+			_ndirty++;
+		}
 	}
 }
 
@@ -3556,7 +3565,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			VacuumPageDirty++;
 			pgBufferUsage.shared_blks_dirtied++;
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageDirty;
+				_ndirty++;
+			}
 		}
 	}
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de..de214f3 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -139,11 +139,15 @@ int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
 int			VacuumCostLimit = 200;
 double		VacuumCostDelay = 0;
-
+double		VacuumCostTotalDelay = 0;
 int			VacuumPageHit = 0;
 int			VacuumPageMiss = 0;
 int			VacuumPageDirty = 0;
 
+int	_nhit = 0;
+int _nmiss = 0;
+int _ndirty = 0;
+
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03f..8d95b6e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -251,6 +251,10 @@ extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
 extern int	VacuumCostLimit;
 extern double VacuumCostDelay;
+extern double VacuumCostTotalDelay;
+extern int	_nhit;
+extern int _nmiss;
+extern int _ndirty;
 
 extern int	VacuumPageHit;
 extern int	VacuumPageMiss;
#44Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Kapila (#42)
Re: cost based vacuum (parallel)

On Fri, 15 Nov 2019 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 13, 2019 at 10:02 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

I've done some tests while changing shared buffer size, delays and
number of workers. The overall results has the similar tendency as the
result shared by Dilip and looks reasonable to me.

Thanks, Sawada-san for repeating the tests. I can see from yours,
Dilip and Mahendra's testing that the delay is distributed depending
upon the I/O done by a particular worker and the total I/O is also as
expected in various kinds of scenarios. So, I think this is a better
approach. Do you agree or you think we should still investigate more
on another approach as well?

I would like to summarize this approach. The basic idea for parallel
vacuum is to allow the parallel workers and master backend to have a
shared view of vacuum cost related parameters (mainly
VacuumCostBalance) and allow each worker to update it and then based
on that decide whether it needs to sleep. With this basic idea, we
found that in some cases the throttling is not accurate as explained
with an example in my email above [1] and then the tests performed by
Dilip and others in the following emails (In short, the workers doing
more I/O can be throttled less). Then as discussed in an email later
[2], we tried a way to avoid letting the workers sleep which has done
less or no I/O as compared to other workers. This ensured that
workers who are doing more I/O got throttled more. The idea is to
allow any worker to sleep only if it has performed the I/O above a
certain threshold and the overall balance is more than the cost_limit
set by the system. Then we will allow the worker to sleep
proportional to the work done by it and reduce the
VacuumSharedCostBalance by the amount which is consumed by the current
worker. This scheme leads to the desired throttling by different
workers based on the work done by the individual worker.

We have tested this idea with various kinds of workloads like by
varying shared buffer size, delays and number of workers. Then also,
we have tried with a different number of indexes and workers. In all
the tests, we found that the workers are throttled proportional to the
I/O being done by a particular worker.

Thank you for summarizing!

I agreed to this approach.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services