Recovery performance of standby for multiple concurrent truncates on large tables
Hello hackers,
Recently, the problem on improving performance of multiple drop/truncate tables in a single transaction with large shared_buffers (as shown below) was solved by commit b416691.
BEGIN;
truncate tbl001;
...
truncate tbl050;
COMMIT;
However, we have a customer that needs to execute multiple concurrent TRUNCATEs (40~50) on different large tables (as shown below) in a shorter amount of time. This one is not covered by the previous commit's improvement.
BEGIN;
truncate tbl001;
COMMIT;
...
BEGIN;
truncate tbl050;
COMMIT;
[Problem]
Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads to separate scans of the whole shared buffer in sequence to check whether or not the table to be deleted is cached in the shared buffer. Moreover, if the size of shared_buffers is large (i.e. 300GB) and the primary server fails during the replay, it would take a long while for the standby to complete recovery.
[Idea]
Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer.
One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan and invalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The replay of TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the shared buffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the buffers of relations that was dropped and truncated.
However, this is still a rough idea, so I am not sure if it’s feasible. I would like to know if the community has advice or other alternative solutions on how to work around this.
Any insights, advice, feedback?
Thank you in advance.
Regards,
Kirk Jamison
Hi,
On 2018-07-10 07:05:12 +0000, Jamison, Kirk wrote:
Hello hackers,
Recently, the problem on improving performance of multiple drop/truncate tables in a single transaction with large shared_buffers (as shown below) was solved by commit b416691.
BEGIN;
truncate tbl001;
...
truncate tbl050;
COMMIT;However, we have a customer that needs to execute multiple concurrent TRUNCATEs (40~50) on different large tables (as shown below) in a shorter amount of time. This one is not covered by the previous commit's improvement.
BEGIN;
truncate tbl001;
COMMIT;
...
BEGIN;
truncate tbl050;
COMMIT;[Problem]
Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads to separate scans of the whole shared buffer in sequence to check whether or not the table to be deleted is cached in the shared buffer. Moreover, if the size of shared_buffers is large (i.e. 300GB) and the primary server fails during the replay, it would take a long while for the standby to complete recovery.
If you do so sequentially on the primary, I'm not clear as to why you
think the issue is bigger in recovery?
[Idea]
Since in the current implementation, the replay of each TRUNCATE/DROP TABLE scans the whole shared buffer.
One approach (though idea is not really developed yet) is to improve the recovery by delaying the shared buffer scan and invalidation (DropRelFileNodeBuffers) and to put it after the next checkpoint (after failover completion). The replay of TRUNCATE/DROP TABLE just make the checkpointer process remember what relations should be invalidated in the shared buffer during subsequent checkpoint. The checkpointer then scans the shared buffer only once to invalidate the buffers of relations that was dropped and truncated.
I think you'd run into a lot of very hairy details with this
approach. Consider what happens if client processes need fresh buffers
and need to write out a victim buffer. You'll need to know that the
relevant buffer is actually invalid. Thus the knowledge about the
"delayed" drops would need to be in shared buffers and scanned on every
dirty buffer writeout.
However, this is still a rough idea, so I am not sure if it’s feasible. I would like to know if the community has advice or other alternative solutions on how to work around this.
Any insights, advice, feedback?
I personally think we should rather just work towards a ordered buffer
mapping implementation.
Greetings,
Andres Freund
On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk <k.jamison@jp.fujitsu.com>
wrote:
Since in the current implementation, the replay of each TRUNCATE/DROP
TABLE scans the whole shared buffer.One approach (though idea is not really developed yet) is to improve the
recovery by delaying the shared buffer scan and invalidation
(DropRelFileNodeBuffers) and to put it after the next checkpoint (after
failover completion). The replay of TRUNCATE/DROP TABLE just make the
checkpointer process remember what relations should be invalidated in the
shared buffer during subsequent checkpoint. The checkpointer then scans the
shared buffer only once to invalidate the buffers of relations that was
dropped and truncated.
How about using the background writer for this? It seems to me that the
main reason to invalidate buffers would be to free them up for buffer
allocation, which is precisely the task of background writer. When adding a
filenode to be invalidated, take note of bgwriter position and add it to a
queue. When bgwriter is advancing, check each buffer tag against a hash
table of filenodes being invalidated. When background writer has completed
a loop it can remove the invalidated filenode. When bgwriter falls behind
the clock sweep and there are filenodes to invalidate it should run the
invalidation scan instead of skipping ahead. If there are already too many
filenodes being invalidated, then whoever is trying to add a new one gets
to run the invalidation scan until something can be evicted.
--
Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com/
Hi,
Thank you for your replies.
On Tue, July 10, 2018 4:15 PM, Andres Freund wrote:
I think you'd run into a lot of very hairy details with this approach. Consider what happens if client processes need fresh buffers and need to write out a victim buffer. You'll need to know that the relevant buffer is actually invalid. Thus the knowledge about the "delayed" drops would need to be in shared buffers and scanned on every dirty buffer writeout.
Yes, you're right. There are problems associated with checkpoint as you pointed out. I just thought of touching the involved process on writing dirty pages to the disk, which are both the functions of checkpoint and background writer.
How about using the background writer for this? ...
...
And now that I think about it, the suggestion of Ants above on "background writer" would seem to work better, as bg writer is more active and continuously write dirty pages but checkpointer only does it in interval. This seems to be a more viable solution and I'd appreciate advice. I'll explore around this idea, and if possible, I'd like to develop a solution for the next commitfest.
I personally think we should rather just work towards a ordered buffer mapping implementation.
I understand that the main problem lies on shared_buffers scanning and that a buffer mapping implementation is underway(?) for PG12. I am not sure if the community has arrived at a consensus for that long-term fix. Would the community also welcome any minor/smaller-scale improvements (just for this particular problem on wal recovery for truncate table)? If yes, then I'd like to work on a possible solution.
Regards,
Kirk Jamison
On 2018-07-19 00:53:14 +0000, Jamison, Kirk wrote:
Hi,
Thank you for your replies.On Tue, July 10, 2018 4:15 PM, Andres Freund wrote:
I think you'd run into a lot of very hairy details with this approach. Consider what happens if client processes need fresh buffers and need to write out a victim buffer. You'll need to know that the relevant buffer is actually invalid. Thus the knowledge about the "delayed" drops would need to be in shared buffers and scanned on every dirty buffer writeout.
Yes, you're right. There are problems associated with checkpoint as you pointed out. I just thought of touching the involved process on writing dirty pages to the disk, which are both the functions of checkpoint and background writer.
How about using the background writer for this? ...
...And now that I think about it, the suggestion of Ants above on "background writer" would seem to work better, as bg writer is more active and continuously write dirty pages but checkpointer only does it in interval. This seems to be a more viable solution and I'd appreciate advice. I'll explore around this idea, and if possible, I'd like to develop a solution for the next commitfest.
It'd have very similar problems as I pointed out above.
I personally think we should rather just work towards a ordered buffer mapping implementation.
I understand that the main problem lies on shared_buffers scanning and that a buffer mapping implementation is underway(?) for PG12. I am not sure if the community has arrived at a consensus for that long-term fix. Would the community also welcome any minor/smaller-scale improvements (just for this particular problem on wal recovery for truncate table)? If yes, then I'd like to work on a possible solution.
I think this is a case where the potential work arounds are complex
enough to use significant resources to get right, and are likely to make
properly fixing the issue harder. I'm willing to comment on proposals
that claim not to be problmatic in those regards, but I have *SERIOUS*
doubts they exist.
Greetings,
Andres Freund
Hi Andres,
I think this is a case where the potential work arounds are complex enough to use significant resources to get right, and are likely to make properly fixing the issue harder. I'm willing to comment on proposals that claim not to be problmatic in those regards, but I have *SERIOUS* doubts they exist.
Alright. But I'd still try and ask your thoughts about it (below).
The proposed design touches the buffer invalidation during recovery process of standby server.
The question was about "how" to remember those buffers that contain truncate/drop tables, right?
1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the time-consuming behavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and invalidating of shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables) by adding them in a hash table called "skip list".
2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against the skip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not be written back to the disk.
Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during recovery process / failover. Any feedback, thoughts?
BTW, are there any updates whether the community will push through anytime soon regarding the buffer mapping implementation you mentioned?
Regards,
Kirk Jamison
On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the time-consuming behavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and invalidating of shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables) by adding them in a hash table called "skip list".
2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against the skip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not be written back to the disk.Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during recovery process / failover. Any feedback, thoughts?
How would this work if a relfilenode number that belonged to an old
relation got recycled for a new relation?
I think something like this could be made to work -- both on the
master and the standby, and not just while waiting for a failover --
if we did something like this:
(1) Limit the number of deferred drops to a reasonably small number
(one cache line? 1kB?).
(2) Background writer regularly does scans do clean out everything
from the deferred-drop list.
(3) If a foreground process finds the deferred-drop list is full when
it needs to add to it, it forces a clean-out of the list contents +
whatever new stuff it has in a single pass.
(4) If we are about generate a relfilenode that's in the list, we
either force a clean out or generate a different relfilenode instead.
(5) Every buffer cleaning operation checks the deferred-drop list and
invalidates without write-out if the buffer is found.
It's not clear to me whether it would be worth the overhead of doing
something like this. Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2018-07-30 16:01:53 -0400, Robert Haas wrote:
(1) Limit the number of deferred drops to a reasonably small number
(one cache line? 1kB?).
Yea, you'd have to, because we'd frequently need to check it, and it'd
need to be in shared memory. But that'd still leave us to regress to
O(n^2) as soon as a transaction goes over that limit - which I don't
think is that infrequent...
Greetings,
Andres Freund
Hi,
On 2018-07-30 05:22:48 +0000, Jamison, Kirk wrote:
BTW, are there any updates whether the community will push through
anytime soon regarding the buffer mapping implementation you
mentioned?
I'm continuing to work on it, but unfortunately there's a couple
projects that have higher priority atm :(. I'm doubtful I can have a
patchset in a committable shape for v12, but I'm pretty sure I'll have
it in a shape good enough to make progress towards v13. Sorry :(
Greetings,
Andres Freund
On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
1. Because the multiple scans of the whole shared buffer per concurrent truncate/drop table was the cause of the time-consuming behavior, DURING the failover process while WAL is being applied, we temporary delay the scanning and invalidating of shared buffers. At the same time, we remember the relations/relfilenodes (of dropped/truncated tables) by adding them in a hash table called "skip list".
2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer only ONCE, compare the pages against the skip list, and invalidates the relevant pages. After deleting the relevant pages on the shared memory, it will not be written back to the disk.Assuming the theory works, this design will only affect the behavior of checkpointer (or maybe bg writer) during recovery process / failover. Any feedback, thoughts?
How would this work if a relfilenode number that belonged to an old
relation got recycled for a new relation?I think something like this could be made to work -- both on the
master and the standby, and not just while waiting for a failover --
if we did something like this:(1) Limit the number of deferred drops to a reasonably small number
(one cache line? 1kB?).
(2) Background writer regularly does scans do clean out everything
from the deferred-drop list.
(3) If a foreground process finds the deferred-drop list is full when
it needs to add to it, it forces a clean-out of the list contents +
whatever new stuff it has in a single pass.
(4) If we are about generate a relfilenode that's in the list, we
either force a clean out or generate a different relfilenode instead.
(5) Every buffer cleaning operation checks the deferred-drop list and
invalidates without write-out if the buffer is found.It's not clear to me whether it would be worth the overhead of doing
something like this. Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.
Perhaps you could make it a bit more incremental, avoiding the full
clean-out scans (your points 2 and 3) while still allowing the
background writer to bear the brunt in the best case? Something like
this:
The zombie relfilenode tracker could be fixed-size and self-cleaning:
entries could be dropped when the CLOCK hand completes a full rotation
since the entry was created, and if it's full when you're trying to
insert a new entry you can scan the buffer pool until you've caused
one entry (the oldest, by clock hand-at-time-of-insertion) to be
remove to make space (so the worst case is a full pass, but hopefully
we can drop one entry before then). The main argument I can think of
against that idea is that it is tied to the current buffer eviction
strategy, using its buffer index-based clock hand as a horizon.
Extracting a useful horizon for algorithms like ARC or CAR would
probably need more thought, because their 'hands' jump around
following next pointers.
Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground
processes. I see why Andres says we should just get a better data
structure so we can make the guy doing the dropping pay for it up
front, but more efficiently. I suspect we may also want an ordered
data structure for write clustering purposes one day.
--
Thomas Munro
http://www.enterprisedb.com
From: Robert Haas [mailto:robertmhaas@gmail.com]
It's not clear to me whether it would be worth the overhead of doing
something like this.
Quite frankly, not really to me, too.
Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.
The purpose is not making relation drops faster (on the primary), but keeping failover time within 10 seconds. I don't really know how crucial that requirement is, but I'm feeling it would be good for PostgreSQL to be able to guarantee shorter failover time.
Regards
Takayuki Tsunakawa
From: 'Andres Freund' [mailto:andres@anarazel.de]
I'm continuing to work on it, but unfortunately there's a couple
projects that have higher priority atm :(. I'm doubtful I can have a
patchset in a committable shape for v12, but I'm pretty sure I'll have
it in a shape good enough to make progress towards v13. Sorry :(
We'd like to do it for PG 12, if Kirk's current proposal doesn't find a good landing point. Could you tell us about how many lines of code you predict, and what part would be difficult? Is there any software you're referring to (e.g. Linux's fsync, MySQL, etc.)?
Regards
Takayuki Tsunakawa
Hi,
I appreciate the feedback and suggestions.
On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
How would this work if a relfilenode number that belonged to an old
relation got recycled for a new relation?
..
I think something like this could be made to work -- both on the
master and the standby, and not just while waiting for a failover --
...
It's not clear to me whether it would be worth the overhead of doing
something like this. Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.
The deferred list has the information such as relfilenode and the head/top
page number of invalid pages. The standby in the promote mode only registers
info in the deferred list when redoing the COMMIT record. However, it scans
the shared buffer cache only once before the end of the promote, and discards
the page related to the table included in the deferred list. After removing,
increment the head page number of the abovementioned invalid page.
In this way, if ever while promoting an INSERT progresses (I'm not sure if
it's possible), the discard progresses too, & the app can refer to it.
As mentioned by Tsunakawa-san above, the purpose of the design is to make
the failover process as shorter as possible and not really to make the
drop of relations faster. My initial thought was not to touch the regular
buffer invalidation process, but I am also not sure of the level of
complexity that the proposed design would affect and how crucial the
requirement (shorter failover) would make, so I asked for community's feedback.
On Tue, July 31, 2018 8:27 AM, Thomas Munro wrote:
Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground processes.
I see why Andres says we should just get a better data structure so we
can make the guy doing the dropping pay for it up front, but more
efficiently. I suspect we may also want an ordered data structure for
write clustering purposes one day.
I also understand the value of solving the root cause of the problem that's why
I asked Andres if we could expect a development from the community for V12
regarding the radix tree approach for buffer management, or even just from anyone who
could start a WIP patch regardless if it's radix tree or not.
And perhaps we'd like to get involved as well as this is also our customer's problem.
So I am just curious about the radix tree approach's design. Maybe we should
start discussing what kind of data structures, processing, etc. are involved?
I also read other design solutions from another thread [1]/messages/by-id/20180419203802.hqrb4o2wexjnb2ft@alvherre.pgsql.
a. Fujii-san's proposal
Add $SUBJECT for performance improvement; reloption to prevent vacuum
from truncating empty pages
b. Pavan's comment
What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again? Surely we revisit all to-be-truncated blocks before
actual truncation. So we already know which buffers to discard. And we're
holding exclusive lock at that point, so nothing can change underneath. Of
course, we can't really remember a large number of buffers, so we can do
this in small chunks. Scan last K blocks, remember those K buffers, discard
those K buffers, truncate the relation and then try for next K blocks. If
another backend requests lock on the table, we give up or retry after a
while.
[1]: /messages/by-id/20180419203802.hqrb4o2wexjnb2ft@alvherre.pgsql
Now, how do we move forward?
Thank you everyone.
Regards,
Kirk Jamison