Speedup of relation deletes during recovery

Started by Fujii Masaoabout 8 years ago38 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v1.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v1.patchDownload+23-6
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#1)
Re: Speedup of relation deletes during recovery

Hello.

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#2)
Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

Yeah, I was just going to say the same after looking at Fujii-san's
patch. This would also cause smgrdounlink() to become unused in the
core code. So this could just be... Removed?

/me Preparing shelter against incoming wraith.

That's not v11 material at this point in any case.
--
Michael

#4Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Fujii Masao (#1)
RE: Speedup of relation deletes during recovery

From: Fujii Masao [mailto:masao.fujii@gmail.com]

When multiple relations are deleted at the same transaction, the files of
those relations are deleted by one call to smgrdounlinkall(), which leads
to scan whole shared_buffers only one time. OTOH, during recovery,
smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much especially
when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Nice catch. As Horiguchi-san and Michael already commented, the patch looks good.

As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possibly TRUNCATE as well.) How should we proceed with this?

/messages/by-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATE does that for each fork, resulting in three scans per table. How about changing the following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock, int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that.

Regards
Takayuki Tsunakawa

#5Jerry Sievers
gsievers19@comcast.net
In reply to: Fujii Masao (#1)
Re: Speedup of relation deletes during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

Wonder if this is the case for streaming standbys replaying truncates
also?

Just couple days ago I was running a pg_upgrade test scenario but did
not reach the point of upgrade yet.

We made snapshots of our large reporting system putting them into a
master and 1-slave streaming replication configuration.

There are hundreds of unlogged tables that we wish to trunc before the
upgrade to save time during the rsyncing of standbys, since a standard
rsync replicates all data which is timeconsumeing and useless sending to
the standbys.

Indeed the tables are already trunc'd on the test master since it too
was a recovered system upon initial start but the truncates took place
anyway since it's part of our framework. It ran in just a few seconds
on master.

The standby however was $slow replaying these truncates which we noticed
because the upgrade wil not proceed until master and 1 or more standbys
are confirmed all at same checkpoint.

I straced the standby's startup process to find it unlinking lots of
tables, getting -1 on the unlink syscall since the non-init fork files
were already missing.

I can't describe just how slow if was but took minutes and the lines
being output by strace were *not* blowing up my display as happens
generally when any busy process is straced.

And, the test systems were config'd with $large shared buffers of 64G.

Please advise

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

#6Jerry Sievers
gsievers19@comcast.net
In reply to: Fujii Masao (#1)
Re: Speedup of relation deletes during recovery

Fujii Masao <masao.fujii@gmail.com> writes:

Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

Forgot to mention version in my TLDR prev reply :-)

version
------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu (Ubuntu 9.5.12-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
(1 row)

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

#7Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Jerry Sievers (#5)
RE: Speedup of relation deletes during recovery

From: Jerry Sievers [mailto:gsievers19@comcast.net]

Wonder if this is the case for streaming standbys replaying truncates
also?

Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE.

Regards
Takayuki Tsunakawa

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#3)
Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 11:46 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

Thanks for the review! I also changed FinishPreparedTransaction() so that
it uses smgrdounlinkall(). Patch attached.

Yeah, I was just going to say the same after looking at Fujii-san's
patch. This would also cause smgrdounlink() to become unused in the
core code. So this could just be... Removed?

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1]/messages/by-id/1471.1339106082@sss.pgh.pa.us, this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

[1]: /messages/by-id/1471.1339106082@sss.pgh.pa.us
/messages/by-id/1471.1339106082@sss.pgh.pa.us

Regards,

--
Fujii Masao

Attachments:

speedup_relation_deletes_during_recovery_v2.patchapplication/octet-stream; name=speedup_relation_deletes_during_recovery_v2.patchDownload+37-17
remove_smgrdounlink_v1.patchapplication/octet-stream; name=remove_smgrdounlink_v1.patchDownload+0-60
remove_smgrdounlinkfork_v1.patchapplication/octet-stream; name=remove_smgrdounlinkfork_v1.patchDownload+0-53
#9Fujii Masao
masao.fujii@gmail.com
In reply to: Tsunakawa, Takayuki (#4)
Re: Speedup of relation deletes during recovery

On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Fujii Masao [mailto:masao.fujii@gmail.com]

When multiple relations are deleted at the same transaction, the files of
those relations are deleted by one call to smgrdounlinkall(), which leads
to scan whole shared_buffers only one time. OTOH, during recovery,
smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much especially
when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Nice catch. As Horiguchi-san and Michael already commented, the patch looks good.

As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possibly TRUNCATE as well.) How should we proceed with this?

/messages/by-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23

Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATE does that for each fork, resulting in three scans per table. How about changing the following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
BlockNumber *firstDelBlock, int nforks)

to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that.

Yeah, it's worth working on this problem. To decrease the number of scans of
shared_buffers, you would need to change the order of truncations of files and
WAL logging. In RelationTruncate(), currently WAL is logged after FSM and VM
are truncated. IOW, with the patch, FSM and VM would need to be truncated after
WAL logging. You would need to check whether this reordering is valid.

Regards,

--
Fujii Masao

#10Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#8)
Re: Speedup of relation deletes during recovery

On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1], this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

Indeed, it's close to six years and the status is the same. So let's
drop it. I have been surrounding the area to see if any modules
actually use those, particularly on github, but I could not find
callers.

The patch looks logically fine to me. In your first message, you
mentioned that the replay time increased a lot. Do you have numbers to
share with some large settings of shared_buffers?

It would be better to wait for v12 branch to open before pushing
anything, as the focus is on stabililizing things on v11.
--
Michael

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#10)
Re: Speedup of relation deletes during recovery

On Wed, Apr 18, 2018 at 10:44 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1], this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

Indeed, it's close to six years and the status is the same. So let's
drop it. I have been surrounding the area to see if any modules
actually use those, particularly on github, but I could not find
callers.

The patch looks logically fine to me. In your first message, you
mentioned that the replay time increased a lot. Do you have numbers to
share with some large settings of shared_buffers?

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

It would be better to wait for v12 branch to open before pushing
anything, as the focus is on stabililizing things on v11.

Yes, so I added this to next CommitFest.

Regards,

--
Fujii Masao

#12Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#11)
Re: Speedup of relation deletes during recovery

On Thu, Apr 19, 2018 at 01:52:26AM +0900, Fujii Masao wrote:

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

Hm, OK. Here are simple functions to create and drop many relations in
a single transaction:
create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
end loop;
end;
$$ language plpgsql;

create or replace function drop_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'drop table tab_' || i::text;
execute query_string;
end loop;
end;
$$ language plpgsql;

With a server running 8GB of shared buffers (you likely need to increase
max_locks_per_transaction) and 10k relations created and dropped, it
took 50 seconds to finish redo of roughly 4 segments:
2018-04-19 11:17:15 JST [7472]: [14-1] db=,user=,app=,client= LOG: redo
starts at 0/15DF2E8
2018-04-19 11:18:05 JST [7472]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E46160: wanted 24, got 0
2018-04-19 11:18:05 JST [7472]: [16-1] db=,user=,app=,client= LOG: redo
done at 0/4A7C6E

Then with your patch it took... Barely 1 second.
2018-04-19 11:20:33 JST [11690]: [14-1] db=,user=,app=,client= LOG:
redo starts at 0/15DF2E8
2018-04-19 11:20:34 JST [11690]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E299B0: wanted 24, got 0
2018-04-19 11:20:34 JST [11690]: [16-1] db=,user=,app=,client= LOG:
redo done at 0/4E29978

Looking at profiles with perf I can also see that 95% of the time is
spent in DropRelFileNodesAllBuffers which is where the startup process
spends most of its CPU. So HEAD is booh. And your patch is excellent
in this context.
--
Michael

#13Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Fujii Masao (#9)
RE: Speedup of relation deletes during recovery

From: Fujii Masao [mailto:masao.fujii@gmail.com]

Yeah, it's worth working on this problem. To decrease the number of scans
of
shared_buffers, you would need to change the order of truncations of files
and
WAL logging. In RelationTruncate(), currently WAL is logged after FSM and
VM
are truncated. IOW, with the patch, FSM and VM would need to be truncated
after
WAL logging. You would need to check whether this reordering is valid.

Sure, thank you for advice.

Takayuki Tsunakawa

#14Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#8)
Re: Speedup of relation deletes during recovery

Hi,

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very significantly.

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1445,6 +1445,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
int			ndelrels;
SharedInvalidationMessage *invalmsgs;
int			i;
+	SMgrRelation *srels = NULL;
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
@@ -1534,13 +1535,16 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
delrels = abortrels;
ndelrels = hdr->nabortrels;
}
+
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+		srels[i] = smgropen(delrels[i], InvalidBackendId);
-		smgrdounlink(srel, false);
-		smgrclose(srel);
-	}
+	smgrdounlinkall(srels, ndelrels, false);
+
+	for (i = 0; i < ndelrels; i++)
+		smgrclose(srels[i]);
+	pfree(srels);

This code is now duplicated three times - shouldn't we just add a
function that encapsulates dropping relations in a commit/abort record?

Greetings,

Andres Freund

#15Teodor Sigaev
teodor@sigaev.ru
In reply to: Andres Freund (#14)
Re: Speedup of relation deletes during recovery

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very significantly.

+1, we faced with that too
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
#16Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#14)
Re: Speedup of relation deletes during recovery

Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:

+
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+		srels[i] = smgropen(delrels[i], InvalidBackendId);
-		smgrdounlink(srel, false);
-		smgrclose(srel);
-	}
+	smgrdounlinkall(srels, ndelrels, false);
+
+	for (i = 0; i < ndelrels; i++)
+		smgrclose(srels[i]);
+	pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?

Greetings,

Andres Freund

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#16)
Re: Speedup of relation deletes during recovery

On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:

+
+   srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-   {
-           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+           srels[i] = smgropen(delrels[i], InvalidBackendId);
-           smgrdounlink(srel, false);
-           smgrclose(srel);
-   }
+   smgrdounlinkall(srels, ndelrels, false);
+
+   for (i = 0; i < ndelrels; i++)
+           smgrclose(srels[i]);
+   pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?

The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

for (i = ndelrels - 1; i >= 0; i--)
smgrclose(srels[i]);

instead of

for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);

Since add_to_unowned_list() always adds the entry into the head of
the list, each SMgrRelation entry is added into the "unowned" list in
descending order of its identifier, i.e., SMgrRelation entry with larger
identifier should be placed ahead of one with smaller identifier.
So if we calls remove_from_unowed_list() in descending order of
SMgrRelation entry's identifier, the performance of
remove_from_unowned_list()'s search should O(1). IOW, we can
immediately find the SMgrRelation entry to remove, at the head of
the list.

Regards,

--
Fujii Masao

#18Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#17)
Re: Speedup of relation deletes during recovery

On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:

On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:

+
+   srels = palloc(sizeof(SMgrRelation) * ndelrels);
for (i = 0; i < ndelrels; i++)
-   {
-           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+           srels[i] = smgropen(delrels[i], InvalidBackendId);
-           smgrdounlink(srel, false);
-           smgrclose(srel);
-   }
+   smgrdounlinkall(srels, ndelrels, false);
+
+   for (i = 0; i < ndelrels; i++)
+           smgrclose(srels[i]);
+   pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to? That's
a bit hacky, but afaict should work?

The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

for (i = ndelrels - 1; i >= 0; i--)
smgrclose(srels[i]);

instead of

for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned. I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

Greetings,

Andres Freund

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Teodor Sigaev (#15)
Re: Speedup of relation deletes during recovery

On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very
significantly.

+1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

Regards,

--
Fujii Masao

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Fujii Masao (#19)
Re: Speedup of relation deletes during recovery

On Tue, Jun 19, 2018 at 6:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot. That means it's very
easy to get into situations where the standy starts to lag behind very
significantly.

+1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

+1

--
Thomas Munro
http://www.enterprisedb.com

#21Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#18)
#22Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#22)
#24Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#23)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#24)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#26)
#28Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#28)
#30Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#29)
#31Fujii Masao
masao.fujii@gmail.com
In reply to: Thomas Munro (#26)
#32Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#31)
#33Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#33)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#35)
#37Jamison, Kirk
k.jamison@jp.fujitsu.com
In reply to: Fujii Masao (#9)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Jamison, Kirk (#37)