Performance degradation on concurrent COPY into a single relation in PG16.
Hi all,
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):
* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202
PG15: nclients = 32, execution time = 3.023
PG15: nclients = 64, execution time = 3.829
PG15: nclients = 128, execution time = 4.111
PG15: nclients = 256, execution time = 4.158
* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074
PG16: nclients = 32, execution time = 4.607
PG16: nclients = 64, execution time = 2.093
PG16: nclients = 128, execution time = 2.141
PG16: nclients = 256, execution time = 2.202
PG16 has better scalability (more than 64 clients) but it took much
more time than PG15, especially at 1 - 16 clients.
The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:
PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373
PG16: nclients = 32, execution time = 3.205
PG16: nclients = 64, execution time = 3.705
PG16: nclients = 128, execution time = 4.196
PG16: nclients = 256, execution time = 4.201
While investigating the cause, I found an interesting fact that in
mdzeroextend if I use only either FileFallocate() or FileZero, we can
get better numbers. For example, If I always use FileZero with the
following change:
@@ -574,7 +574,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
* that decision should be made though? For now just use a cutoff of
* 8, anything between 4 and 8 worked OK in some local testing.
*/
- if (numblocks > 8)
+ if (false)
{
int ret;
I got:
PG16: nclients = 1, execution time = 16.898
PG16: nclients = 2, execution time = 8.740
PG16: nclients = 4, execution time = 4.656
PG16: nclients = 8, execution time = 2.733
PG16: nclients = 16, execution time = 2.021
PG16: nclients = 32, execution time = 1.693
PG16: nclients = 64, execution time = 1.742
PG16: nclients = 128, execution time = 2.180
PG16: nclients = 256, execution time = 2.296
After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem. Does anyone
observe a similar performance issue with the attached benchmark
script?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
test.shtext/x-sh; charset=US-ASCII; name=test.shDownload
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.
FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():
$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m1.305s
user 0m0.050s
sys 0m1.255s
$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 1m29.222s
user 0m0.139s
sys 0m3.139s
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
test.ctext/x-c-code; charset=US-ASCII; name=test.cDownload
On 03/07/2023 05:59, Masahiko Sawada wrote:
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0real 0m1.305s
user 0m0.050s
sys 0m1.255s$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000real 1m29.222s
user 0m0.139s
sys 0m3.139s
This must be highly dependent on the underlying OS and filesystem. I'm
not seeing that effect on my laptop:
/data$ time /tmp/test test.0 0
total 200000
fallocate 0
filewrite 200000
real 0m1.856s
user 0m0.140s
sys 0m1.688s
/data$ time /tmp/test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m1.335s
user 0m0.156s
sys 0m1.179s
/data$ time /tmp/test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m2.159s
user 0m0.165s
sys 0m1.880s
/data$ uname -a
Linux heikkilaptop 6.0.0-6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.12-1
(2022-12-09) x86_64 GNU/Linux
/data is an nvme drive with ext4 filesystem.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Mon, Jul 3, 2023 at 4:36 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 03/07/2023 05:59, Masahiko Sawada wrote:
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0real 0m1.305s
user 0m0.050s
sys 0m1.255s$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000real 1m29.222s
user 0m0.139s
sys 0m3.139sThis must be highly dependent on the underlying OS and filesystem.
Right. The above were the result where I created the file on the xfs
filesystem. The kernel version and the xfs filesystem version are:
% uname -rms
Linux 4.18.0-372.9.1.el8.x86_64 x86_64
% sudo xfs_db -r /dev/nvme4n1p2
xfs_db> version
versionnum [0xb4b5+0x18a] =
V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK
As far as I tested, it happens only on the xfs filesystem (at least
the above version) and doesn't happen on ext4 and ext3 filesystems.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi Masahiko,
Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
[root@rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m5.868s
user 0m0.035s
sys 0m5.716s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m25.858s
user 0m0.108s
sys 0m3.596s
[root@rockyora ~]# time ./test test.3 2
total 200000
fallocate 100000
filewrite 100000
real 0m25.927s
user 0m0.091s
sys 0m3.621s
[root@rockyora ~]# time ./test test.4 1
total 200000
fallocate 200000
filewrite 0
real 0m3.044s
user 0m0.043s
sys 0m2.934s
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:
[..]
8,0 2 34172 24.115364875 312 D WS 44624928 + 16
[kworker/2:1H]
8,0 2 34173 24.115482679 0 C WS 44624928 + 16 [0]
8,0 2 34174 24.115548251 6501 A WS 42525760 + 16 <- (253,0)
34225216
8,0 2 34175 24.115548660 6501 A WS 44624960 + 16 <- (8,2)
42525760
8,0 2 34176 24.115549111 6501 Q WS 44624960 + 16 [test]
8,0 2 34177 24.115551351 6501 G WS 44624960 + 16 [test]
8,0 2 34178 24.115552111 6501 I WS 44624960 + 16 [test]
8,0 2 34179 24.115559713 312 D WS 44624960 + 16
[kworker/2:1H]
8,0 2 34180 24.115677217 0 C WS 44624960 + 16 [0]
8,0 2 34181 24.115743150 6501 A WS 42525792 + 16 <- (253,0)
34225248
8,0 2 34182 24.115743502 6501 A WS 44624992 + 16 <- (8,2)
42525792
8,0 2 34183 24.115743949 6501 Q WS 44624992 + 16 [test]
8,0 2 34184 24.115746175 6501 G WS 44624992 + 16 [test]
8,0 2 34185 24.115746918 6501 I WS 44624992 + 16 [test]
8,0 2 34186 24.115754492 312 D WS 44624992 + 16
[kworker/2:1H]
So it looks like you are onto something.
Regards,
-J.
Hello,
On 2023-Jul-03, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
This item came up in the RMT meeting. Andres, I think this item belongs
to you, because of commit 00d1e02be2.
The regression seems serious enough at low client counts:
* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202
* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074
So the fact that the speed has clearly gone up at larger client counts
is not an excuse for not getting it fixed, XFS-specificity
notwithstanding.
The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373
Well, these numbers are better, but they still look worse than PG15.
I suppose there are other commits that share blame.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)
Hi,
On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote:
On 2023-Jul-03, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.This item came up in the RMT meeting. Andres, I think this item belongs
to you, because of commit 00d1e02be2.
I'll take a look - I wasn't even aware of this thread until now.
Greetings,
Andres Freund
Hi,
On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):
Gah, RHEL with its frankenkernels, the bane of my existance.
FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?
Greetings,
Andres Freund
Hi,
On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote:
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0real 0m1.305s
user 0m0.050s
sys 0m1.255s$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000real 1m29.222s
user 0m0.139s
sys 0m3.139s
On an xfs filesystem, with a very recent kernel:
time /tmp/msw_test /srv/dev/fio/msw 0
total 200000
fallocate 0
filewrite 200000
real 0m0.456s
user 0m0.017s
sys 0m0.439s
time /tmp/msw_test /srv/dev/fio/msw 1
total 200000
fallocate 200000
filewrite 0
real 0m0.141s
user 0m0.010s
sys 0m0.131s
time /tmp/msw_test /srv/dev/fio/msw 2
total 200000
fallocate 100000
filewrite 100000
real 0m0.297s
user 0m0.017s
sys 0m0.280s
So I don't think I can reproduce your problem on that system...
I also tried adding a fdatasync() into the loop, but that just made things
uniformly slow.
I guess I'll try to dig up whether this is a problem in older upstream
kernels, or whether it's been introduced in RHEL.
Greetings,
Andres Freund
Hi,
On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:
That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.
I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.
One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.
I don't yet have a RHEL8 system at hand, could either of you send the result
of a
strace -c -p $pid_of_backend_doing_copy
Greetings,
Andres Freund
On Tue, Jul 11, 2023 at 12:34 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):Gah, RHEL with its frankenkernels, the bane of my existance.
FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?
I've tested this on RHEL9 (m5.24xlarge; 96vCPUs, 384GB RAM), and it
seems to be reproducible on RHEL9 too.
$ uname -rms
Linux 5.14.0-284.11.1.el9_2.x86_64 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
PG15: nclients = 1, execution time = 22.193
PG15: nclients = 2, execution time = 12.430
PG15: nclients = 4, execution time = 8.677
PG15: nclients = 8, execution time = 6.144
PG15: nclients = 16, execution time = 5.938
PG15: nclients = 32, execution time = 5.775
PG15: nclients = 64, execution time = 5.928
PG15: nclients = 128, execution time = 6.346
PG15: nclients = 256, execution time = 6.641
PG16: nclients = 1, execution time = 24.601
PG16: nclients = 2, execution time = 27.845
PG16: nclients = 4, execution time = 40.575
PG16: nclients = 8, execution time = 24.379
PG16: nclients = 16, execution time = 15.835
PG16: nclients = 32, execution time = 8.370
PG16: nclients = 64, execution time = 4.042
PG16: nclients = 128, execution time = 2.956
PG16: nclients = 256, execution time = 2.591
Tests with test.c program:
$ rm -f test.data; time ./test test.data 0
total 200000
fallocate 0
filewrite 200000
real 0m0.709s
user 0m0.057s
sys 0m0.649s
$ rm -f test.data; time ./test test.data 1
total 200000
fallocate 200000
filewrite 0
real 0m0.853s
user 0m0.058s
sys 0m0.791s
$ rm -f test.data; time ./test test.data 2
total 200000
fallocate 100000
filewrite 100000
real 2m10.002s
user 0m0.366s
sys 0m6.649s
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Jul 11, 2023 at 1:24 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.I don't yet have a RHEL8 system at hand, could either of you send the result
of a
strace -c -p $pid_of_backend_doing_copy
Here are the results:
* PG16: nclients = 1, execution time = 23.758
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
53.18 0.308675 0 358898 pwrite64
33.92 0.196900 2 81202 pwritev
7.78 0.045148 0 81378 lseek
5.06 0.029371 2 11141 read
0.04 0.000250 2 91 openat
0.02 0.000094 1 89 close
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 2 1 recvfrom
0.00 0.000000 0 2 kill
0.00 0.000000 0 1 futex
0.00 0.000000 0 1 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.580438 1 532891 1 total
* PG16: nclients = 2, execution time = 18.045
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.19 0.218721 1 187803 pwrite64
29.24 0.118002 2 40242 pwritev
6.23 0.025128 0 41239 lseek
5.03 0.020285 2 9133 read
2.04 0.008229 9 855 fallocate
1.28 0.005151 0 5598 1000 futex
1.12 0.004516 1 3965 kill
0.78 0.003141 1 3058 1 epoll_wait
0.06 0.000224 2 100 openat
0.03 0.000136 1 98 close
0.01 0.000050 0 84 brk
0.00 0.000013 0 22 setitimer
0.00 0.000006 0 15 1 rt_sigreturn
0.00 0.000002 2 1 munmap
0.00 0.000002 2 1 sendto
0.00 0.000002 0 3 2 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.403608 1 292217 1004 total
* PG16: nclients = 4, execution time = 18.807
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.61 0.240171 2 93868 pwrite64
15.11 0.056173 4 11337 pwritev
7.29 0.027100 7 3465 fallocate
4.05 0.015048 2 5179 read
3.55 0.013188 0 14941 lseek
2.65 0.009858 1 8675 2527 futex
1.76 0.006536 1 4190 kill
0.88 0.003268 1 2199 epoll_wait
0.06 0.000213 2 101 openat
0.03 0.000130 1 99 close
0.01 0.000031 1 18 rt_sigreturn
0.01 0.000027 1 17 setitimer
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 1 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.371743 2 144176 2527 total
* PG16: nclients = 8, execution time = 11.982
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.16 0.180095 3 47895 pwrite64
8.61 0.021194 5 4199 pwritev
5.93 0.014598 6 2199 fallocate
3.42 0.008425 1 6723 2206 futex
3.18 0.007824 2 3068 read
2.44 0.005995 0 6510 lseek
1.82 0.004475 1 2665 kill
1.27 0.003118 1 1758 2 epoll_wait
0.10 0.000239 2 95 openat
0.06 0.000141 1 93 close
0.01 0.000034 2 16 setitimer
0.01 0.000020 2 10 2 rt_sigreturn
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 2 1 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.246158 3 75319 2211 total
* PG16: nclients = 16, execution time = 7.507
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
79.45 0.078310 5 14870 pwrite64
5.52 0.005440 5 973 pwritev
4.51 0.004443 6 640 fallocate
3.69 0.003640 1 2884 1065 futex
2.23 0.002200 2 866 read
1.80 0.001775 1 1685 lseek
1.44 0.001421 1 782 kill
1.08 0.001064 2 477 1 epoll_wait
0.13 0.000129 2 57 openat
0.08 0.000078 1 56 close
0.06 0.000055 0 84 brk
0.00 0.000003 3 1 munmap
0.00 0.000003 3 1 sendto
0.00 0.000003 1 2 1 recvfrom
0.00 0.000002 0 5 setitimer
0.00 0.000001 0 3 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.098567 4 23386 1068 total
* PG16: nclients = 32, execution time = 4.644
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
88.90 0.147208 12 11571 pwrite64
3.11 0.005151 1 2643 943 futex
2.61 0.004314 4 1039 pwritev
1.74 0.002879 8 327 fallocate
1.21 0.001998 3 624 read
0.90 0.001498 1 1439 lseek
0.66 0.001090 3 358 1 epoll_wait
0.63 0.001049 2 426 kill
0.12 0.000206 3 65 openat
0.07 0.000118 1 64 close
0.03 0.000045 0 84 brk
0.01 0.000011 11 1 munmap
0.00 0.000008 8 1 sendto
0.00 0.000007 3 2 1 recvfrom
0.00 0.000002 0 3 1 rt_sigreturn
0.00 0.000001 0 3 setitimer
------ ----------- ----------- --------- --------- ----------------
100.00 0.165585 8 18650 946 total
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.
I haven't seen this thread [1]https://www.spinics.net/lists/linux-xfs/msg73035.html "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate(). What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.
All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):
[root@rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root@rockyora ~]# time ./test test.0 0
total 200000
fallocate 0
filewrite 200000
real 0m0.405s
user 0m0.006s
sys 0m0.391s
[root@rockyora ~]# time ./test test.0 1
total 200000
fallocate 200000
filewrite 0
real 0m0.137s
user 0m0.005s
sys 0m0.132s
[root@rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m0.968s
user 0m0.020s
sys 0m0.928s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m6.059s
user 0m0.000s
sys 0m0.788s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 0m0.598s
user 0m0.003s
sys 0m0.225s
[root@rockyora ~]#
iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 15394.00 0.00 122.02 0.00 13.00
0.00 0.08 0.00 0.05 0.75 0.00 8.12 0.06 100.00
dm-0 0.00 15407.00 0.00 122.02 0.00 0.00
0.00 0.00 0.00 0.06 0.98 0.00 8.11 0.06 100.00
So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)
-J.
Hi,
On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote:
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply.
I think that's just due to the reproducer being a bit too minimal and the
actual problem being addressed not being explained.
He also argues that posix_fallocate() != fallocate(). What's interesting is
that it's by design and newer kernel versions should not prevent such
behaviour, see my testing result below.
I think the problem there was that I was not targetting a different file
between the different runs, somehow assuming the test program would be taking
care of that.
I don't think the test program actually tests things in a particularly useful
way - it does fallocate()s in 8k chunks - which postgres never does.
All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):
test.c really doesn't reproduce postgres behaviour in any meaningful way,
using it as a benchmark is a bad idea.
Greetings,
Andres Freund
Hi,
On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:PG16: nclients = 1, execution time = 17.444
I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.
1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROM
This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?
2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals
pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)
15:
6281.107 ms
HEAD:
7000.469 ms
backing out 9f8377f7a27:
6433.516 ms
also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms
I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().
For higher concurrency numbers, I now was able to reproduce the regression, to
a smaller degree. Much smaller after fixing the above. The reason we run into
the issue here is basically that the rows in the test are very narrow and reach
#define MAX_BUFFERED_TUPLES 1000
at a small number of pages, so we go back and forth between extending with
fallocate() and not.
I'm *not* saying that that is the solution, but after changing that to 5000,
the numbers look a lot better (with the other regressions "worked around"):
(this is again with turboboost disabled, to get more reproducible numbers)
clients 1 2 4 8 16 32
15,buffered=1000 25725 13211 9232 5639 4862 4700
15,buffered=5000 26107 14550 8644 6050 4943 4766
HEAD+fixes,buffered=1000 25875 14505 8200 4900 3565 3433
HEAD+fixes,buffered=5000 25830 12975 6527 3594 2739 2642
Greetings,
Andres Freund
[1]: /messages/by-id/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw+4hm=2EOFXbJQ3tOg@mail.gmail.com
On Wed, Jul 12, 2023 at 3:52 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:PG16: nclients = 1, execution time = 17.444
I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.
Right.
1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROMThis added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?
Yeah, it seems to me that we can palloc the bool array once and use it
for the entire COPY FROM. With the attached small patch, the
performance becomes much better:
15:
14.70500 sec
16:
17.42900 sec
16 w/ patch:
14.85600 sec
2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals
Agreed.
pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)15:
6281.107 msHEAD:
7000.469 msbacking out 9f8377f7a27:
6433.516 msalso backing out faff8f8e47f, 6fcda9aba83:
6235.453 msI suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().
I'll look at how to fix 2).
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
fix_COPY_DEFAULT.patchapplication/octet-stream; name=fix_COPY_DEFAULT.patchDownload+2-3
On Wed, Jul 12, 2023 at 5:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Jul 12, 2023 at 3:52 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:PG16: nclients = 1, execution time = 17.444
I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.Right.
1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROMThis added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?Yeah, it seems to me that we can palloc the bool array once and use it
for the entire COPY FROM. With the attached small patch, the
performance becomes much better:15:
14.70500 sec16:
17.42900 sec16 w/ patch:
14.85600 sec2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literalsAgreed.
pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)15:
6281.107 msHEAD:
7000.469 msbacking out 9f8377f7a27:
6433.516 msalso backing out faff8f8e47f, 6fcda9aba83:
6235.453 msI suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().I'll look at how to fix 2).
I have made some progress on dealing with performance regression on
single client COPY. I've attached a patch to fix 2). With the patch I
shared[1]/messages/by-id/CAD21AoBb9Sbddh+nQk1BxUFaRHaa+fL8fCzO-Lvxqb6ZcpAHqw@mail.gmail.com to deal with 1), single client COPY performance seems to be
now as good as (or slightly better than) PG15 . Here are the results
(averages of 5 times) of loading 50M rows via COPY:
15:
7.609 sec
16:
8.637 sec
16 w/ two patches:
7.179 sec
Regards,
[1]: /messages/by-id/CAD21AoBb9Sbddh+nQk1BxUFaRHaa+fL8fCzO-Lvxqb6ZcpAHqw@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
fix_numeric.patchapplication/octet-stream; name=fix_numeric.patchDownload+48-0
On Wed, 19 Jul 2023 at 09:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literalsAgreed.
I have made some progress on dealing with performance regression on
single client COPY. I've attached a patch to fix 2). With the patch I
shared[1] to deal with 1), single client COPY performance seems to be
now as good as (or slightly better than) PG15 . Here are the results
(averages of 5 times) of loading 50M rows via COPY:
Hmm, I'm somewhat sceptical about this second patch. It's not obvious
why adding such tests would speed it up, and indeed, testing on my
machine with 50M rows, I see a noticeable speed-up from patch 1, and a
slow-down from patch 2:
PG15
====
7390.461 ms
7497.655 ms
7485.850 ms
7406.336 ms
HEAD
====
8388.707 ms
8283.484 ms
8391.638 ms
8363.306 ms
HEAD + P1
=========
7255.128 ms
7185.319 ms
7197.822 ms
7191.176 ms
HEAD + P2
=========
8687.164 ms
8654.907 ms
8641.493 ms
8668.865 ms
HEAD + P1 + P2
==============
7780.126 ms
7786.427 ms
7775.047 ms
7785.938 ms
So for me at least, just applying patch 1 gives the best results, and
makes it slightly faster than PG15 (possibly due to 6b423ec677).
Regards,
Dean
On Wed, 19 Jul 2023 at 23:14, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
Hmm, I'm somewhat sceptical about this second patch. It's not obvious
why adding such tests would speed it up, and indeed, testing on my
machine with 50M rows, I see a noticeable speed-up from patch 1, and a
slow-down from patch 2:
I noticed that 6fcda9aba added quite a lot of conditions that need to
be checked before we process a normal decimal integer string. I think
we could likely do better and code it to assume that most strings will
be decimal and put that case first rather than last.
In the attached, I've changed that for the 32-bit version only. A
more complete patch would need to do the 16 and 64-bit versions too.
-- setup
create table a (a int);
insert into a select x from generate_series(1,10000000)x;
copy a to '~/a.dump';
-- test
truncate a; copy a from '/tmp/a.dump';
master @ ab29a7a9c
Time: 2152.633 ms (00:02.153)
Time: 2121.091 ms (00:02.121)
Time: 2100.995 ms (00:02.101)
Time: 2101.724 ms (00:02.102)
Time: 2103.949 ms (00:02.104)
master + pg_strtoint32_base_10_first.patch
Time: 2061.464 ms (00:02.061)
Time: 2035.513 ms (00:02.036)
Time: 2028.356 ms (00:02.028)
Time: 2043.218 ms (00:02.043)
Time: 2037.035 ms (00:02.037) (~3.6% faster)
Without that, we need to check if the first digit is '0' a total of 3
times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
'b' or 'B'. It seems to be coded with the assumption that hex strings
are the most likely. I think decimals are the most likely by far.
David
Attachments:
pg_strtoint32_base_10_first.patchtext/plain; charset=US-ASCII; name=pg_strtoint32_base_10_first.patchDownload+35-29
On Thu, 20 Jul 2023 at 00:56, David Rowley <dgrowleyml@gmail.com> wrote:
I noticed that 6fcda9aba added quite a lot of conditions that need to
be checked before we process a normal decimal integer string. I think
we could likely do better and code it to assume that most strings will
be decimal and put that case first rather than last.
That sounds sensible, but ...
In the attached, I've changed that for the 32-bit version only. A
more complete patch would need to do the 16 and 64-bit versions too.Without that, we need to check if the first digit is '0' a total of 3
times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
'b' or 'B'.
That's not what I see. For me, the compiler (gcc 11, using either -O2
or -O3) is smart enough to spot that the first part of each of the 3
checks is the same, and it only tests whether the first digit is '0'
once, before immediately jumping to the decimal parsing code. I didn't
test other compilers though, so I can believe that different compilers
might not be so smart.
OTOH, this test in your patch:
+ /* process decimal digits */
+ if (isdigit((unsigned char) ptr[0]) &&
+ (isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1]
== '\0' || isspace(ptr[1])))
is doing more work than it needs to, and actually makes things
noticeably worse for me. It needs to do a minimum of 2 isdigit()
checks before it will parse the input as a decimal, whereas before
(for me at least) it just did one simple comparison of ptr[0] against
'0'.
I agree with the principal though. In the attached updated patch, I
replaced that test with a simpler one:
+ /*
+ * Process the number's digits. We optimize for decimal input (expected to
+ * be the most common case) first. Anything that doesn't start with a base
+ * prefix indicator must be decimal.
+ */
+
+ /* process decimal digits */
+ if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
So hopefully any compiler should only need to do the one comparison
against '0' for any non-zero decimal input.
Doing that didn't give any measurable performance improvement for me,
but it did at least not make it noticeably worse, and seems more
likely to generate better code with not-so-smart compilers. I'd be
interested to know how that performs for you (and if your compiler
really does generate 3 "first digit is '0'" tests for the unpatched
code).
Regards,
Dean
---
Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"),
and I tested COPY loading 50M rows:
HEAD + P1
=========
7137.966 ms
7193.190 ms
7094.491 ms
7123.520 ms
HEAD + P1 + pg_strtoint32_base_10_first.patch
=============================================
7561.658 ms
7548.282 ms
7551.360 ms
7560.671 ms
HEAD + P1 + pg_strtoint32_base_10_first.v2.patch
================================================
7238.775 ms
7184.937 ms
7123.257 ms
7159.618 ms