Performance degradation on concurrent COPY into a single relation in PG16.

Started by Masahiko Sawadaover 2 years ago54 messages

sawada.mshk@gmail.com

over 2 years ago

1 attachment(s)

Hi all,

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202
PG15: nclients = 32, execution time = 3.023
PG15: nclients = 64, execution time = 3.829
PG15: nclients = 128, execution time = 4.111
PG15: nclients = 256, execution time = 4.158

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074
PG16: nclients = 32, execution time = 4.607
PG16: nclients = 64, execution time = 2.093
PG16: nclients = 128, execution time = 2.141
PG16: nclients = 256, execution time = 2.202

PG16 has better scalability (more than 64 clients) but it took much
more time than PG15, especially at 1 - 16 clients.

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373
PG16: nclients = 32, execution time = 3.205
PG16: nclients = 64, execution time = 3.705
PG16: nclients = 128, execution time = 4.196
PG16: nclients = 256, execution time = 4.201

While investigating the cause, I found an interesting fact that in
mdzeroextend if I use only either FileFallocate() or FileZero, we can
get better numbers. For example, If I always use FileZero with the
following change:

@@ -574,7 +574,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
* that decision should be made though? For now just use a cutoff of
* 8, anything between 4 and 8 worked OK in some local testing.
*/
- if (numblocks > 8)
+ if (false)
{
int ret;

I got:

PG16: nclients = 1, execution time = 16.898
PG16: nclients = 2, execution time = 8.740
PG16: nclients = 4, execution time = 4.656
PG16: nclients = 8, execution time = 2.733
PG16: nclients = 16, execution time = 2.021
PG16: nclients = 32, execution time = 1.693
PG16: nclients = 64, execution time = 1.742
PG16: nclients = 128, execution time = 2.180
PG16: nclients = 256, execution time = 2.296

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem. Does anyone
observe a similar performance issue with the attached benchmark
script?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#1)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.

FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m1.305s
user 0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 1m29.222s
user 0m0.139s
sys 0m3.139s

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Heikki Linnakangas

hlinnaka@iki.fi

over 2 years ago

In reply to: Masahiko Sawada (#2)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On 03/07/2023 05:59, Masahiko Sawada wrote:

On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.

FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m1.305s
user 0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 1m29.222s
user 0m0.139s
sys 0m3.139s

This must be highly dependent on the underlying OS and filesystem. I'm
not seeing that effect on my laptop:

/data$ time /tmp/test test.0 0
total 200000
fallocate 0
filewrite 200000

real 0m1.856s
user 0m0.140s
sys 0m1.688s
/data$ time /tmp/test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m1.335s
user 0m0.156s
sys 0m1.179s
/data$ time /tmp/test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m2.159s
user 0m0.165s
sys 0m1.880s

/data$ uname -a
Linux heikkilaptop 6.0.0-6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.12-1
(2022-12-09) x86_64 GNU/Linux

/data is an nvme drive with ext4 filesystem.

--
Heikki Linnakangas
Neon (https://neon.tech)

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#3)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Mon, Jul 3, 2023 at 4:36 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/07/2023 05:59, Masahiko Sawada wrote:

On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.

FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m1.305s
user 0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 1m29.222s
user 0m0.139s
sys 0m3.139s

This must be highly dependent on the underlying OS and filesystem.

Right. The above were the result where I created the file on the xfs
filesystem. The kernel version and the xfs filesystem version are:

% uname -rms
Linux 4.18.0-372.9.1.el8.x86_64 x86_64

% sudo xfs_db -r /dev/nvme4n1p2
xfs_db> version
versionnum [0xb4b5+0x18a] =
V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK

As far as I tested, it happens only on the xfs filesystem (at least
the above version) and doesn't happen on ext4 and ext3 filesystems.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Jakub Wartak

jakub.wartak@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#4)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi Masahiko,

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:

[root@rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m5.868s
user 0m0.035s
sys 0m5.716s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m25.858s
user 0m0.108s
sys 0m3.596s
[root@rockyora ~]# time ./test test.3 2
total 200000
fallocate 100000
filewrite 100000

real 0m25.927s
user 0m0.091s
sys 0m3.621s
[root@rockyora ~]# time ./test test.4 1
total 200000
fallocate 200000
filewrite 0

real 0m3.044s
user 0m0.043s
sys 0m2.934s

According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:
[..]
8,0 2 34172 24.115364875 312 D WS 44624928 + 16
[kworker/2:1H]
8,0 2 34173 24.115482679 0 C WS 44624928 + 16 [0]
8,0 2 34174 24.115548251 6501 A WS 42525760 + 16 <- (253,0)
34225216
8,0 2 34175 24.115548660 6501 A WS 44624960 + 16 <- (8,2)
42525760
8,0 2 34176 24.115549111 6501 Q WS 44624960 + 16 [test]
8,0 2 34177 24.115551351 6501 G WS 44624960 + 16 [test]
8,0 2 34178 24.115552111 6501 I WS 44624960 + 16 [test]
8,0 2 34179 24.115559713 312 D WS 44624960 + 16
[kworker/2:1H]
8,0 2 34180 24.115677217 0 C WS 44624960 + 16 [0]
8,0 2 34181 24.115743150 6501 A WS 42525792 + 16 <- (253,0)
34225248
8,0 2 34182 24.115743502 6501 A WS 44624992 + 16 <- (8,2)
42525792
8,0 2 34183 24.115743949 6501 Q WS 44624992 + 16 [test]
8,0 2 34184 24.115746175 6501 G WS 44624992 + 16 [test]
8,0 2 34185 24.115746918 6501 I WS 44624992 + 16 [test]
8,0 2 34186 24.115754492 312 D WS 44624992 + 16
[kworker/2:1H]

So it looks like you are onto something.

Regards,
-J.

Alvaro Herrera

alvherre@alvh.no-ip.org

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hello,

On 2023-Jul-03, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.

This item came up in the RMT meeting. Andres, I think this item belongs
to you, because of commit 00d1e02be2.

The regression seems serious enough at low client counts:

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074

So the fact that the speed has clearly gone up at larger client counts
is not an excuse for not getting it fixed, XFS-specificity
notwithstanding.

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373

Well, these numbers are better, but they still look worse than PG15.
I suppose there are other commits that share blame.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Alvaro Herrera (#6)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote:

On 2023-Jul-03, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.

This item came up in the RMT meeting. Andres, I think this item belongs
to you, because of commit 00d1e02be2.

I'll take a look - I wasn't even aware of this thread until now.

Greetings,

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

Gah, RHEL with its frankenkernels, the bane of my existance.

FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?

Greetings,

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#2)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote:

On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.

FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m1.305s
user 0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 1m29.222s
user 0m0.139s
sys 0m3.139s

On an xfs filesystem, with a very recent kernel:

time /tmp/msw_test /srv/dev/fio/msw 0
total 200000
fallocate 0
filewrite 200000

real 0m0.456s
user 0m0.017s
sys 0m0.439s

time /tmp/msw_test /srv/dev/fio/msw 1
total 200000
fallocate 200000
filewrite 0

real 0m0.141s
user 0m0.010s
sys 0m0.131s

time /tmp/msw_test /srv/dev/fio/msw 2
total 200000
fallocate 100000
filewrite 100000

real 0m0.297s
user 0m0.017s
sys 0m0.280s

So I don't think I can reproduce your problem on that system...

I also tried adding a fdatasync() into the loop, but that just made things
uniformly slow.

I guess I'll try to dig up whether this is a problem in older upstream
kernels, or whether it's been introduced in RHEL.

Greetings,

Andres Freund

#10

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jakub Wartak (#5)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.

I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.

One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.

I don't yet have a RHEL8 system at hand, could either of you send the result
of a
strace -c -p $pid_of_backend_doing_copy

Greetings,

Andres Freund

#11

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Andres Freund (#8)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, Jul 11, 2023 at 12:34 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

Gah, RHEL with its frankenkernels, the bane of my existance.

FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?

I've tested this on RHEL9 (m5.24xlarge; 96vCPUs, 384GB RAM), and it
seems to be reproducible on RHEL9 too.

$ uname -rms
Linux 5.14.0-284.11.1.el9_2.x86_64 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)

PG15: nclients = 1, execution time = 22.193
PG15: nclients = 2, execution time = 12.430
PG15: nclients = 4, execution time = 8.677
PG15: nclients = 8, execution time = 6.144
PG15: nclients = 16, execution time = 5.938
PG15: nclients = 32, execution time = 5.775
PG15: nclients = 64, execution time = 5.928
PG15: nclients = 128, execution time = 6.346
PG15: nclients = 256, execution time = 6.641

PG16: nclients = 1, execution time = 24.601
PG16: nclients = 2, execution time = 27.845
PG16: nclients = 4, execution time = 40.575
PG16: nclients = 8, execution time = 24.379
PG16: nclients = 16, execution time = 15.835
PG16: nclients = 32, execution time = 8.370
PG16: nclients = 64, execution time = 4.042
PG16: nclients = 128, execution time = 2.956
PG16: nclients = 256, execution time = 2.591

Tests with test.c program:

$ rm -f test.data; time ./test test.data 0
total 200000
fallocate 0
filewrite 200000

real 0m0.709s
user 0m0.057s
sys 0m0.649s

$ rm -f test.data; time ./test test.data 1
total 200000
fallocate 200000
filewrite 0

real 0m0.853s
user 0m0.058s
sys 0m0.791s

$ rm -f test.data; time ./test test.data 2
total 200000
fallocate 100000
filewrite 100000

real 2m10.002s
user 0m0.366s
sys 0m6.649s

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Andres Freund (#10)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, Jul 11, 2023 at 1:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.

I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.

One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.

I don't yet have a RHEL8 system at hand, could either of you send the result
of a
strace -c -p $pid_of_backend_doing_copy

Here are the results:

* PG16: nclients = 1, execution time = 23.758
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
53.18 0.308675 0 358898 pwrite64
33.92 0.196900 2 81202 pwritev
7.78 0.045148 0 81378 lseek
5.06 0.029371 2 11141 read
0.04 0.000250 2 91 openat
0.02 0.000094 1 89 close
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 2 1 recvfrom
0.00 0.000000 0 2 kill
0.00 0.000000 0 1 futex
0.00 0.000000 0 1 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.580438 1 532891 1 total

* PG16: nclients = 2, execution time = 18.045
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.19 0.218721 1 187803 pwrite64
29.24 0.118002 2 40242 pwritev
6.23 0.025128 0 41239 lseek
5.03 0.020285 2 9133 read
2.04 0.008229 9 855 fallocate
1.28 0.005151 0 5598 1000 futex
1.12 0.004516 1 3965 kill
0.78 0.003141 1 3058 1 epoll_wait
0.06 0.000224 2 100 openat
0.03 0.000136 1 98 close
0.01 0.000050 0 84 brk
0.00 0.000013 0 22 setitimer
0.00 0.000006 0 15 1 rt_sigreturn
0.00 0.000002 2 1 munmap
0.00 0.000002 2 1 sendto
0.00 0.000002 0 3 2 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.403608 1 292217 1004 total

* PG16: nclients = 4, execution time = 18.807
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.61 0.240171 2 93868 pwrite64
15.11 0.056173 4 11337 pwritev
7.29 0.027100 7 3465 fallocate
4.05 0.015048 2 5179 read
3.55 0.013188 0 14941 lseek
2.65 0.009858 1 8675 2527 futex
1.76 0.006536 1 4190 kill
0.88 0.003268 1 2199 epoll_wait
0.06 0.000213 2 101 openat
0.03 0.000130 1 99 close
0.01 0.000031 1 18 rt_sigreturn
0.01 0.000027 1 17 setitimer
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 1 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.371743 2 144176 2527 total

* PG16: nclients = 8, execution time = 11.982
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.16 0.180095 3 47895 pwrite64
8.61 0.021194 5 4199 pwritev
5.93 0.014598 6 2199 fallocate
3.42 0.008425 1 6723 2206 futex
3.18 0.007824 2 3068 read
2.44 0.005995 0 6510 lseek
1.82 0.004475 1 2665 kill
1.27 0.003118 1 1758 2 epoll_wait
0.10 0.000239 2 95 openat
0.06 0.000141 1 93 close
0.01 0.000034 2 16 setitimer
0.01 0.000020 2 10 2 rt_sigreturn
0.00 0.000000 0 1 munmap
0.00 0.000000 0 84 brk
0.00 0.000000 0 1 sendto
0.00 0.000000 0 2 1 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.246158 3 75319 2211 total

* PG16: nclients = 16, execution time = 7.507
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
79.45 0.078310 5 14870 pwrite64
5.52 0.005440 5 973 pwritev
4.51 0.004443 6 640 fallocate
3.69 0.003640 1 2884 1065 futex
2.23 0.002200 2 866 read
1.80 0.001775 1 1685 lseek
1.44 0.001421 1 782 kill
1.08 0.001064 2 477 1 epoll_wait
0.13 0.000129 2 57 openat
0.08 0.000078 1 56 close
0.06 0.000055 0 84 brk
0.00 0.000003 3 1 munmap
0.00 0.000003 3 1 sendto
0.00 0.000003 1 2 1 recvfrom
0.00 0.000002 0 5 setitimer
0.00 0.000001 0 3 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.098567 4 23386 1068 total

* PG16: nclients = 32, execution time = 4.644
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
88.90 0.147208 12 11571 pwrite64
3.11 0.005151 1 2643 943 futex
2.61 0.004314 4 1039 pwritev
1.74 0.002879 8 327 fallocate
1.21 0.001998 3 624 read
0.90 0.001498 1 1439 lseek
0.66 0.001090 3 358 1 epoll_wait
0.63 0.001049 2 426 kill
0.12 0.000206 3 65 openat
0.07 0.000118 1 64 close
0.03 0.000045 0 84 brk
0.01 0.000011 11 1 munmap
0.00 0.000008 8 1 sendto
0.00 0.000007 3 2 1 recvfrom
0.00 0.000002 0 3 1 rt_sigreturn
0.00 0.000001 0 3 setitimer
------ ----------- ----------- --------- --------- ----------------
100.00 0.165585 8 18650 946 total

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13

Jakub Wartak

jakub.wartak@enterprisedb.com

over 2 years ago

In reply to: Andres Freund (#10)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.

I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.

I haven't seen this thread [1]https://www.spinics.net/lists/linux-xfs/msg73035.html "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate(). What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

[root@rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root@rockyora ~]# time ./test test.0 0
total 200000
fallocate 0
filewrite 200000

real 0m0.405s
user 0m0.006s
sys 0m0.391s
[root@rockyora ~]# time ./test test.0 1
total 200000
fallocate 200000
filewrite 0

real 0m0.137s
user 0m0.005s
sys 0m0.132s
[root@rockyora ~]# time ./test test.1 1
total 200000
fallocate 200000
filewrite 0

real 0m0.968s
user 0m0.020s
sys 0m0.928s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m6.059s
user 0m0.000s
sys 0m0.788s
[root@rockyora ~]# time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000

real 0m0.598s
user 0m0.003s
sys 0m0.225s
[root@rockyora ~]#

iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 15394.00 0.00 122.02 0.00 13.00
0.00 0.08 0.00 0.05 0.75 0.00 8.12 0.06 100.00
dm-0 0.00 15407.00 0.00 122.02 0.00 0.00
0.00 0.00 0.00 0.06 0.98 0.00 8.11 0.06 100.00

So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)

-J.

[1]: https://www.spinics.net/lists/linux-xfs/msg73035.html

#14

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jakub Wartak (#13)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote:

On Mon, Jul 10, 2023 at 6:24 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:
...
According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well. It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.

I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.

I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply.

I think that's just due to the reproducer being a bit too minimal and the
actual problem being addressed not being explained.

He also argues that posix_fallocate() != fallocate(). What's interesting is
that it's by design and newer kernel versions should not prevent such
behaviour, see my testing result below.

I think the problem there was that I was not targetting a different file
between the different runs, somehow assuming the test program would be taking
care of that.

I don't think the test program actually tests things in a particularly useful
way - it does fallocate()s in 8k chunks - which postgres never does.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

test.c really doesn't reproduce postgres behaviour in any meaningful way,
using it as a benchmark is a bad idea.

Greetings,

Andres Freund

#15

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?

2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms

I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().

For higher concurrency numbers, I now was able to reproduce the regression, to
a smaller degree. Much smaller after fixing the above. The reason we run into
the issue here is basically that the rows in the test are very narrow and reach

#define MAX_BUFFERED_TUPLES 1000

at a small number of pages, so we go back and forth between extending with
fallocate() and not.

I'm *not* saying that that is the solution, but after changing that to 5000,
the numbers look a lot better (with the other regressions "worked around"):

(this is again with turboboost disabled, to get more reproducible numbers)

clients 1 2 4 8 16 32

15,buffered=1000 25725 13211 9232 5639 4862 4700
15,buffered=5000 26107 14550 8644 6050 4943 4766
HEAD+fixes,buffered=1000 25875 14505 8200 4900 3565 3433
HEAD+fixes,buffered=5000 25830 12975 6527 3594 2739 2642

Greetings,

Andres Freund

[1]: /messages/by-id/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw+4hm=2EOFXbJQ3tOg@mail.gmail.com

#16

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Andres Freund (#15)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, Jul 12, 2023 at 3:52 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

Right.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?

Yeah, it seems to me that we can palloc the bool array once and use it
for the entire COPY FROM. With the attached small patch, the
performance becomes much better:

15:
14.70500 sec

16:
17.42900 sec

16 w/ patch:
14.85600 sec

2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

Agreed.

pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms

I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().

I'll look at how to fix 2).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_COPY_DEFAULT.patchapplication/octet-stream; name=fix_COPY_DEFAULT.patchDownload

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 80bca79cd0..eb47dea960 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1494,6 +1494,7 @@ BeginCopyFrom(ParseState *pstate,
 	cstate->cur_attname = NULL;
 	cstate->cur_attval = NULL;
 	cstate->relname_only = false;
+	cstate->defaults = (bool *) palloc0(tupDesc->natts * sizeof(bool));
 
 	/*
 	 * Allocate buffers for the input pipeline.
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index d2e8c74402..232768a6e1 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -871,7 +871,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 	/* Initialize all values for row to NULL */
 	MemSet(values, 0, num_phys_attrs * sizeof(Datum));
 	MemSet(nulls, true, num_phys_attrs * sizeof(bool));
-	cstate->defaults = (bool *) palloc0(num_phys_attrs * sizeof(bool));
+	MemSet(cstate->defaults, false, num_phys_attrs * sizeof(bool));
 
 	if (!cstate->opts.binary)
 	{
@@ -1040,8 +1040,6 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 										 &nulls[defmap[i]]);
 	}
 
-	pfree(cstate->defaults);
-
 	return true;
 }

#17

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#16)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, Jul 12, 2023 at 5:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 12, 2023 at 3:52 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

Right.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?

Yeah, it seems to me that we can palloc the bool array once and use it
for the entire COPY FROM. With the attached small patch, the
performance becomes much better:

15:
14.70500 sec

16:
17.42900 sec

16 w/ patch:
14.85600 sec

2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

Agreed.

pinned to one cpu, turbo mode disabled, I get the following best-of-three times for
copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms

I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().

I'll look at how to fix 2).

I have made some progress on dealing with performance regression on
single client COPY. I've attached a patch to fix 2). With the patch I
shared[1]/messages/by-id/CAD21AoBb9Sbddh+nQk1BxUFaRHaa+fL8fCzO-Lvxqb6ZcpAHqw@mail.gmail.com to deal with 1), single client COPY performance seems to be
now as good as (or slightly better than) PG15 . Here are the results
(averages of 5 times) of loading 50M rows via COPY:

15:
7.609 sec

16:
8.637 sec

16 w/ two patches:
7.179 sec

Regards,

[1]: /messages/by-id/CAD21AoBb9Sbddh+nQk1BxUFaRHaa+fL8fCzO-Lvxqb6ZcpAHqw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_numeric.patchapplication/octet-stream; name=fix_numeric.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..54c2b31c66 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -141,6 +141,10 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely(!isxdigit((unsigned char) *ptr) && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isxdigit((unsigned char) *ptr))
@@ -165,6 +169,10 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '7') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
@@ -189,6 +197,10 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '1') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
@@ -213,6 +225,10 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
+		/* require at least one digit */
+		if (unlikely(!isdigit((unsigned char) *ptr)))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isdigit((unsigned char) *ptr))
@@ -319,6 +335,10 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely(!isxdigit((unsigned char) *ptr) && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isxdigit((unsigned char) *ptr))
@@ -343,6 +363,10 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '7') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
@@ -367,6 +391,10 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '1') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
@@ -391,6 +419,10 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
+		/* require at least one digit */
+		if (unlikely(!isdigit((unsigned char) *ptr)))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isdigit((unsigned char) *ptr))
@@ -497,6 +529,10 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely(!isxdigit((unsigned char) *ptr) && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isxdigit((unsigned char) *ptr))
@@ -521,6 +557,10 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '7') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
@@ -545,6 +585,10 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
+		/* require at least one digit or underscore */
+		if (unlikely((*ptr < '0' || *ptr > '1') && *ptr != '_'))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
@@ -569,6 +613,10 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
+		/* require at least one digit */
+		if (unlikely(!isdigit((unsigned char) *ptr)))
+			goto invalid_syntax;
+
 		while (*ptr)
 		{
 			if (isdigit((unsigned char) *ptr))

#18

Dean Rasheed

dean.a.rasheed@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#17)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 19 Jul 2023 at 09:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

2) pg_strtoint32_safe() got substantially slower, mainly due
to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

Agreed.

I have made some progress on dealing with performance regression on
single client COPY. I've attached a patch to fix 2). With the patch I
shared[1] to deal with 1), single client COPY performance seems to be
now as good as (or slightly better than) PG15 . Here are the results
(averages of 5 times) of loading 50M rows via COPY:

Hmm, I'm somewhat sceptical about this second patch. It's not obvious
why adding such tests would speed it up, and indeed, testing on my
machine with 50M rows, I see a noticeable speed-up from patch 1, and a
slow-down from patch 2:

PG15
====

7390.461 ms
7497.655 ms
7485.850 ms
7406.336 ms

HEAD
====

8388.707 ms
8283.484 ms
8391.638 ms
8363.306 ms

HEAD + P1
=========

7255.128 ms
7185.319 ms
7197.822 ms
7191.176 ms

HEAD + P2
=========

8687.164 ms
8654.907 ms
8641.493 ms
8668.865 ms

HEAD + P1 + P2
==============

7780.126 ms
7786.427 ms
7775.047 ms
7785.938 ms

So for me at least, just applying patch 1 gives the best results, and
makes it slightly faster than PG15 (possibly due to 6b423ec677).

Regards,
Dean

#19

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Dean Rasheed (#18)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 19 Jul 2023 at 23:14, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Hmm, I'm somewhat sceptical about this second patch. It's not obvious
why adding such tests would speed it up, and indeed, testing on my
machine with 50M rows, I see a noticeable speed-up from patch 1, and a
slow-down from patch 2:

I noticed that 6fcda9aba added quite a lot of conditions that need to
be checked before we process a normal decimal integer string. I think
we could likely do better and code it to assume that most strings will
be decimal and put that case first rather than last.

In the attached, I've changed that for the 32-bit version only. A
more complete patch would need to do the 16 and 64-bit versions too.

-- setup
create table a (a int);
insert into a select x from generate_series(1,10000000)x;
copy a to '~/a.dump';

-- test
truncate a; copy a from '/tmp/a.dump';

master @ ab29a7a9c
Time: 2152.633 ms (00:02.153)
Time: 2121.091 ms (00:02.121)
Time: 2100.995 ms (00:02.101)
Time: 2101.724 ms (00:02.102)
Time: 2103.949 ms (00:02.104)

master + pg_strtoint32_base_10_first.patch
Time: 2061.464 ms (00:02.061)
Time: 2035.513 ms (00:02.036)
Time: 2028.356 ms (00:02.028)
Time: 2043.218 ms (00:02.043)
Time: 2037.035 ms (00:02.037) (~3.6% faster)

Without that, we need to check if the first digit is '0' a total of 3
times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
'b' or 'B'. It seems to be coded with the assumption that hex strings
are the most likely. I think decimals are the most likely by far.

David

Attachments:

pg_strtoint32_base_10_first.patchtext/plain; charset=US-ASCII; name=pg_strtoint32_base_10_first.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..60c90f7252 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -314,113 +314,119 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	else if (*ptr == '+')
 		ptr++;
 
-	/* process digits */
-	if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+	/* process decimal digits */
+	if (isdigit((unsigned char) ptr[0]) &&
+		(isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1] == '\0' || isspace(ptr[1])))
 	{
-		firstdigit = ptr += 2;
+		firstdigit = ptr;
 
 		while (*ptr)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			if (isdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 					goto out_of_range;
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				tmp = tmp * 10 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-				/* underscore must be followed by more digits */
+				/* underscore may not be first */
+				if (unlikely(ptr == firstdigit))
+					goto invalid_syntax;
+				/* and it must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
+				if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+	/* process hex digits */
+	else if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			if (isxdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
 					goto out_of_range;
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
 			}
 			else if (*ptr == '_')
 			{
 				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || *ptr < '0' || *ptr > '7')
+				if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))
+	/* process octal digits */
+	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '1')
+			if (*ptr >= '0' && *ptr <= '7')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
 					goto out_of_range;
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				tmp = tmp * 8 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
 				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || *ptr < '0' || *ptr > '1')
+				if (*ptr == '\0' || *ptr < '0' || *ptr > '7')
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else
+	/* process binary digits */
+	else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))
 	{
-		firstdigit = ptr;
+		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '1')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
 					goto out_of_range;
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				tmp = tmp * 2 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-				/* underscore may not be first */
-				if (unlikely(ptr == firstdigit))
-					goto invalid_syntax;
-				/* and it must be followed by more digits */
+				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
+				if (*ptr == '\0' || *ptr < '0' || *ptr > '1')
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
+	else
+		goto invalid_syntax;
 
 	/* require at least one digit */
 	if (unlikely(ptr == firstdigit))
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))

#20

Dean Rasheed

dean.a.rasheed@gmail.com

over 2 years ago

In reply to: David Rowley (#19)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Thu, 20 Jul 2023 at 00:56, David Rowley <dgrowleyml@gmail.com> wrote:

I noticed that 6fcda9aba added quite a lot of conditions that need to
be checked before we process a normal decimal integer string. I think
we could likely do better and code it to assume that most strings will
be decimal and put that case first rather than last.

That sounds sensible, but ...

In the attached, I've changed that for the 32-bit version only. A
more complete patch would need to do the 16 and 64-bit versions too.

Without that, we need to check if the first digit is '0' a total of 3
times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
'b' or 'B'.

That's not what I see. For me, the compiler (gcc 11, using either -O2
or -O3) is smart enough to spot that the first part of each of the 3
checks is the same, and it only tests whether the first digit is '0'
once, before immediately jumping to the decimal parsing code. I didn't
test other compilers though, so I can believe that different compilers
might not be so smart.

OTOH, this test in your patch:

+    /* process decimal digits */
+    if (isdigit((unsigned char) ptr[0]) &&
+        (isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1]
== '\0' || isspace(ptr[1])))

is doing more work than it needs to, and actually makes things
noticeably worse for me. It needs to do a minimum of 2 isdigit()
checks before it will parse the input as a decimal, whereas before
(for me at least) it just did one simple comparison of ptr[0] against
'0'.

I agree with the principal though. In the attached updated patch, I
replaced that test with a simpler one:

+    /*
+     * Process the number's digits. We optimize for decimal input (expected to
+     * be the most common case) first. Anything that doesn't start with a base
+     * prefix indicator must be decimal.
+     */
+
+   /* process decimal digits */
+   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))

So hopefully any compiler should only need to do the one comparison
against '0' for any non-zero decimal input.

Doing that didn't give any measurable performance improvement for me,
but it did at least not make it noticeably worse, and seems more
likely to generate better code with not-so-smart compilers. I'd be
interested to know how that performs for you (and if your compiler
really does generate 3 "first digit is '0'" tests for the unpatched
code).

Regards,
Dean

---

Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"),
and I tested COPY loading 50M rows:

HEAD + P1
=========

7137.966 ms
7193.190 ms
7094.491 ms
7123.520 ms

HEAD + P1 + pg_strtoint32_base_10_first.patch
=============================================

7561.658 ms
7548.282 ms
7551.360 ms
7560.671 ms

HEAD + P1 + pg_strtoint32_base_10_first.v2.patch
================================================

7238.775 ms
7184.937 ms
7123.257 ms
7159.618 ms

Attachments:

pg_strtoint32_base_10_first.v2.patchtext/x-patch; charset=US-ASCII; name=pg_strtoint32_base_10_first.v2.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
new file mode 100644
index 471fbb7..e2320e6
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -314,113 +314,124 @@ pg_strtoint32_safe(const char *s, Node *
 	else if (*ptr == '+')
 		ptr++;
 
-	/* process digits */
-	if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+	/*
+	 * Process the number's digits. We optimize for decimal input (expected to
+	 * be the most common case) first. Anything that doesn't start with a base
+	 * prefix indicator must be decimal.
+	 */
+
+	/* process decimal digits */
+	if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
 	{
-		firstdigit = ptr += 2;
+		firstdigit = ptr;
 
 		while (*ptr)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			if (isdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 					goto out_of_range;
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				tmp = tmp * 10 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-				/* underscore must be followed by more digits */
+				/* underscore may not be first */
+				if (unlikely(ptr == firstdigit))
+					goto invalid_syntax;
+				/* and it must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
+				if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+	/* process hex digits */
+	else if (ptr[1] == 'x' || ptr[1] == 'X')
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			if (isxdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
 					goto out_of_range;
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
 			}
 			else if (*ptr == '_')
 			{
 				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || *ptr < '0' || *ptr > '7')
+				if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))
+	/* process octal digits */
+	else if (ptr[1] == 'o' || ptr[1] == 'O')
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '1')
+			if (*ptr >= '0' && *ptr <= '7')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
 					goto out_of_range;
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				tmp = tmp * 8 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
 				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || *ptr < '0' || *ptr > '1')
+				if (*ptr == '\0' || *ptr < '0' || *ptr > '7')
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
-	else
+	/* process binary digits */
+	else if (ptr[1] == 'b' || ptr[1] == 'B')
 	{
-		firstdigit = ptr;
+		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '1')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
+				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
 					goto out_of_range;
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				tmp = tmp * 2 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-				/* underscore may not be first */
-				if (unlikely(ptr == firstdigit))
-					goto invalid_syntax;
-				/* and it must be followed by more digits */
+				/* underscore must be followed by more digits */
 				ptr++;
-				if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
+				if (*ptr == '\0' || *ptr < '0' || *ptr > '1')
 					goto invalid_syntax;
 			}
 			else
 				break;
 		}
 	}
+	else
+		goto invalid_syntax;
 
 	/* require at least one digit */
 	if (unlikely(ptr == firstdigit))
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))

#21

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Dean Rasheed (#20)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Thu, 20 Jul 2023 at 20:37, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 20 Jul 2023 at 00:56, David Rowley <dgrowleyml@gmail.com> wrote:
I agree with the principal though. In the attached updated patch, I
replaced that test with a simpler one:
+    /*
+     * Process the number's digits. We optimize for decimal input (expected to
+     * be the most common case) first. Anything that doesn't start with a base
+     * prefix indicator must be decimal.
+     */
+
+   /* process decimal digits */
+   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
So hopefully any compiler should only need to do the one comparison
against '0' for any non-zero decimal input.

Doing that didn't give any measurable performance improvement for me,
but it did at least not make it noticeably worse, and seems more
likely to generate better code with not-so-smart compilers. I'd be
interested to know how that performs for you (and if your compiler
really does generate 3 "first digit is '0'" tests for the unpatched
code).

That seems better. I benchmarked it on two machines:

gcc12.2/linux/amd3990x
create table a (a int);
insert into a select x from generate_series(1,10000000)x;
copy a to '/tmp/a.dump';

master @ ab29a7a9c
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2226.912 ms (00:02.227)
Time: 2214.168 ms (00:02.214)
Time: 2206.481 ms (00:02.206)
Time: 2219.640 ms (00:02.220)
Time: 2205.004 ms (00:02.205)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2067.108 ms (00:02.067)
Time: 2070.401 ms (00:02.070)
Time: 2073.423 ms (00:02.073)
Time: 2065.407 ms (00:02.065)
Time: 2066.536 ms (00:02.067) (~7% faster)

apple m2 pro/clang

master @ 9089287a
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1286.369 ms (00:01.286)
Time: 1300.534 ms (00:01.301)
Time: 1295.661 ms (00:01.296)
Time: 1296.404 ms (00:01.296)
Time: 1268.361 ms (00:01.268)
Time: 1259.321 ms (00:01.259)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1253.519 ms (00:01.254)
Time: 1235.256 ms (00:01.235)
Time: 1269.501 ms (00:01.270)
Time: 1267.801 ms (00:01.268)
Time: 1275.758 ms (00:01.276)
Time: 1261.478 ms (00:01.261) (a bit noisy but avg of ~1.8% faster)

David

#22

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Dean Rasheed (#20)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

Hm, in some cases your patch is better, but in others both the old code
(8692f6644e7) and HEAD beat yours on my machine. TBH, not entirely sure why.

prep:
COPY (SELECT generate_series(1, 2000000) a, (random() * 100000 - 50000)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15

I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215:

HEAD: 812.690

your patch: 821.354

strtoint from 8692f6644e7: 824.543

strtoint from 6b423ec677d^: 806.678

(when I say strtoint from, I did not replace the goto labels, so that part is
unchanged and unrelated)

IOW, for me the code from 15 is the fastest by a good bit... There's an imul,
sure, but the fact that it sets a flag makes it faster than having to add more
tests and branches.

Looking at a profile reminded me of how silly it is that we are wasting a good
chunk of the time in these isdigit() checks, even though we already rely on on
the ascii values via (via *ptr++ - '0'). I think that's done in the headers
for some platforms, but not others (glibc). And we've even done already for
octal and binary!

Open coding isdigit() gives us:

HEAD: 797.434

your patch: 803.570

strtoint from 8692f6644e7: 778.943

strtoint from 6b423ec677d^: 777.741

It's somewhat odd that HEAD and your patch switch position here...

-	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+	/* process hex digits */
+	else if (ptr[1] == 'x' || ptr[1] == 'X')
{

firstdigit = ptr += 2;

I find this unnecessarily hard to read. I realize it's been added in
6fcda9aba83, but I don't see a reason to use such a construct here.

I find it somewhat grating how much duplication there now is in this
code due to being repeated for all the bases...

Greetings,

Andres Freund

#23

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andres Freund (#22)

2 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, 25 Jul 2023 at 17:34, Andres Freund <andres@anarazel.de> wrote:

prep:
COPY (SELECT generate_series(1, 2000000) a, (random() * 100000 - 50000)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15

I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215:

HEAD: 812.690

your patch: 821.354

strtoint from 8692f6644e7: 824.543

strtoint from 6b423ec677d^: 806.678

I'm surprised to see the imul version is faster. It's certainly not
what we found when working on 6b423ec67.

I've been fooling around a bit to try to learn what might be going on.
I wrote 2 patches:

1) pg_strtoint_imul.patch: Effectively reverts 6b423ec67 and puts the
code how it likely would have looked had I not done that work at all.
(I've assumed that the hex, octal, binary parsing would have been
added using the overflow multiplication functions just as the base 10
parsing).

2) pg_strtoint_optimize.patch: I've made another pass over the
functions with the current overflow checks and optimized the digit
checking code so that it can be done in a single check like: if (digit
< 10). This can be done by assigning the result of *ptr - '0' to an
unsigned char. Anything less than '0' will wrap around and anything
above '9' will remain so. I've also removed a few inefficiencies with
the isspace checking. We didn't need to do "while (*ptr &&
isspace(*ptr)). It's fine just to do while (isspace(*ptr)) since '\0'
isn't a space char. I also got rid of the isxdigit call. The
hexlookup array contains -1 for non-hex chars. We may as well check
the digit value is >= 0.

Here are the results I get using your test as quoted above:

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch

latency average = 568.102 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch

latency average = 531.238 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch

latency average = 559.333 ms (surprisingly also faster on my machine)

The optimized version of the pg_strtoint functions wins over the imul
patch. Could you test to see if this is also the case in your Xeon
machine?

(when I say strtoint from, I did not replace the goto labels, so that part is
unchanged and unrelated)

I didn't quite follow this.

I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a
reason to delay committing that? It would be good to eliminate that
as a variable for the current performance regression.

David

Attachments:

pg_strtoint_imul.patchtext/plain; charset=US-ASCII; name=pg_strtoint_imul.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a6e9d00727 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -104,10 +104,10 @@ static const int8 hexlookup[128] = {
  * pg_strtoint16() will throw ereport() upon bad input format or overflow;
  * while pg_strtoint16_safe() instead returns such complaints in *escontext,
  * if it's an ErrorSaveContext.
-*
- * NB: Accumulate input as an unsigned number, to deal with two's complement
+ *
+ * NB: Accumulate input as a negative number, to deal with two's complement
  * representation of the most negative number, which can't be represented as a
- * signed positive number.
+ * positive number.
  */
 int16
 pg_strtoint16(const char *s)
@@ -120,7 +120,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 {
 	const char *ptr = s;
 	const char *firstdigit;
-	uint16		tmp = 0;
+	int16		tmp = 0;
 	bool		neg = false;
 
 	/* skip leading spaces */
@@ -145,10 +145,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT16_MIN / 16)))
-					goto out_of_range;
+				int8 digit = hexlookup[(unsigned char) *ptr++];
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				if (unlikely(pg_mul_s16_overflow(tmp, 16, &tmp)) ||
+					unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -169,10 +170,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
-				if (unlikely(tmp > -(PG_INT16_MIN / 8)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s16_overflow(tmp, 8, &tmp)) ||
+					unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -193,10 +195,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
-				if (unlikely(tmp > -(PG_INT16_MIN / 2)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s16_overflow(tmp, 2, &tmp)) ||
+					unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -217,10 +220,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		{
 			if (isdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT16_MIN / 10)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s16_overflow(tmp, 10, &tmp)) ||
+					unlikely(pg_sub_s16_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -248,18 +252,15 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	if (unlikely(*ptr != '\0'))
 		goto invalid_syntax;
 
-	if (neg)
+	if (!neg)
 	{
-		/* check the negative equivalent will fit without overflowing */
-		if (tmp > (uint16) (-(PG_INT16_MIN + 1)) + 1)
+		/* could fail if input is most negative number */
+		if (unlikely(tmp == PG_INT16_MIN))
 			goto out_of_range;
-		return -((int16) tmp);
+		tmp = -tmp;
 	}
 
-	if (tmp > PG_INT16_MAX)
-		goto out_of_range;
-
-	return (int16) tmp;
+	return tmp;
 
 out_of_range:
 	ereturn(escontext, 0,
@@ -298,7 +299,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 {
 	const char *ptr = s;
 	const char *firstdigit;
-	uint32		tmp = 0;
+	int32		tmp = 0;
 	bool		neg = false;
 
 	/* skip leading spaces */
@@ -323,10 +324,11 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
-					goto out_of_range;
+				int8 digit = hexlookup[(unsigned char) *ptr++];
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				if (unlikely(pg_mul_s32_overflow(tmp, 16, &tmp)) ||
+					unlikely(pg_sub_s32_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -347,10 +349,11 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s32_overflow(tmp, 8, &tmp)) ||
+					unlikely(pg_sub_s32_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -371,10 +374,11 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s32_overflow(tmp, 2, &tmp)) ||
+					unlikely(pg_sub_s32_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -395,10 +399,11 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		{
 			if (isdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s32_overflow(tmp, 10, &tmp)) ||
+					unlikely(pg_sub_s32_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -426,18 +431,15 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	if (unlikely(*ptr != '\0'))
 		goto invalid_syntax;
 
-	if (neg)
+	if (!neg)
 	{
-		/* check the negative equivalent will fit without overflowing */
-		if (tmp > (uint32) (-(PG_INT32_MIN + 1)) + 1)
+		/* could fail if input is most negative number */
+		if (unlikely(tmp == PG_INT32_MIN))
 			goto out_of_range;
-		return -((int32) tmp);
+		tmp = -tmp;
 	}
 
-	if (tmp > PG_INT32_MAX)
-		goto out_of_range;
-
-	return (int32) tmp;
+	return tmp;
 
 out_of_range:
 	ereturn(escontext, 0,
@@ -461,9 +463,9 @@ invalid_syntax:
  * while pg_strtoint64_safe() instead returns such complaints in *escontext,
  * if it's an ErrorSaveContext.
  *
- * NB: Accumulate input as an unsigned number, to deal with two's complement
+ * NB: Accumulate input as a negative number, to deal with two's complement
  * representation of the most negative number, which can't be represented as a
- * signed positive number.
+ * positive number.
  */
 int64
 pg_strtoint64(const char *s)
@@ -476,7 +478,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 {
 	const char *ptr = s;
 	const char *firstdigit;
-	uint64		tmp = 0;
+	int64		tmp = 0;
 	bool		neg = false;
 
 	/* skip leading spaces */
@@ -501,10 +503,11 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT64_MIN / 16)))
-					goto out_of_range;
+				int8 digit = hexlookup[(unsigned char) *ptr++];
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				if (unlikely(pg_mul_s64_overflow(tmp, 16, &tmp)) ||
+					unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -525,10 +528,11 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
-				if (unlikely(tmp > -(PG_INT64_MIN / 8)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s64_overflow(tmp, 8, &tmp)) ||
+					unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -549,10 +553,11 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
-				if (unlikely(tmp > -(PG_INT64_MIN / 2)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s64_overflow(tmp, 2, &tmp)) ||
+					unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -573,10 +578,11 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		{
 			if (isdigit((unsigned char) *ptr))
 			{
-				if (unlikely(tmp > -(PG_INT64_MIN / 10)))
-					goto out_of_range;
+				int8 digit = (*ptr++ - '0');
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) ||
+					unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
+					goto out_of_range;
 			}
 			else if (*ptr == '_')
 			{
@@ -604,18 +610,15 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	if (unlikely(*ptr != '\0'))
 		goto invalid_syntax;
 
-	if (neg)
+	if (!neg)
 	{
-		/* check the negative equivalent will fit without overflowing */
-		if (tmp > (uint64) (-(PG_INT64_MIN + 1)) + 1)
+		/* could fail if input is most negative number */
+		if (unlikely(tmp == PG_INT64_MIN))
 			goto out_of_range;
-		return -((int64) tmp);
+		tmp = -tmp;
 	}
 
-	if (tmp > PG_INT64_MAX)
-		goto out_of_range;
-
-	return (int64) tmp;
+	return tmp;
 
 out_of_range:
 	ereturn(escontext, 0,

pg_strtoint_optimize.patchtext/plain; charset=US-ASCII; name=pg_strtoint_optimize.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..12d8b7eb69 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -124,7 +124,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	bool		neg = false;
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -141,14 +141,17 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			int8 digit = hexlookup[(unsigned char) *ptr];
+
+			if (likely(digit >= 0))
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 16)))
 					goto out_of_range;
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				tmp = tmp * 16 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -165,14 +168,17 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 8))
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 8)))
 					goto out_of_range;
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				tmp = tmp * 8 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -189,14 +195,17 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '1')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 2))
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 2)))
 					goto out_of_range;
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				tmp = tmp * 2 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -213,14 +222,17 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isdigit((unsigned char) *ptr))
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 10))
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 10)))
 					goto out_of_range;
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				tmp = tmp * 10 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -302,7 +314,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	bool		neg = false;
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -319,14 +331,17 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			int8 digit = hexlookup[(unsigned char) *ptr];
+
+			if (likely(digit >= 0))
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 16)))
 					goto out_of_range;
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				tmp = tmp * 16 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -343,14 +358,17 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 8))
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 8)))
 					goto out_of_range;
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				tmp = tmp * 8 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -367,14 +385,17 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '1')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 2))
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 2)))
 					goto out_of_range;
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				tmp = tmp * 2 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -391,14 +412,17 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isdigit((unsigned char) *ptr))
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 10))
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 					goto out_of_range;
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				tmp = tmp * 10 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -420,7 +444,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))
@@ -480,7 +504,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	bool		neg = false;
 
 	/* skip leading spaces */
-	while (*ptr && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -497,14 +521,17 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			int8 digit = hexlookup[(unsigned char) *ptr];
+
+			if (likely(digit >= 0))
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 16)))
 					goto out_of_range;
 
-				tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+				tmp = tmp * 16 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -521,14 +548,17 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 8))
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 8)))
 					goto out_of_range;
 
-				tmp = tmp * 8 + (*ptr++ - '0');
+				tmp = tmp * 8 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -545,14 +575,17 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		while (1)
 		{
-			if (*ptr >= '0' && *ptr <= '1')
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 2))
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 2)))
 					goto out_of_range;
 
-				tmp = tmp * 2 + (*ptr++ - '0');
+				tmp = tmp * 2 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -569,14 +602,17 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		while (1)
 		{
-			if (isdigit((unsigned char) *ptr))
+			unsigned char digit = *ptr - '0';
+
+			if (likely(digit < 10))
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 10)))
 					goto out_of_range;
 
-				tmp = tmp * 10 + (*ptr++ - '0');
+				tmp = tmp * 10 + digit;
+				ptr++;
 			}
 			else if (*ptr == '_')
 			{
@@ -598,7 +634,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))

#24

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: David Rowley (#23)

2 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-25 23:37:08 +1200, David Rowley wrote:

On Tue, 25 Jul 2023 at 17:34, Andres Freund <andres@anarazel.de> wrote:
I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a
reason to delay committing that? It would be good to eliminate that
as a variable for the current performance regression.

Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
want to commit it? Or Andrew?

prep:
COPY (SELECT generate_series(1, 2000000) a, (random() * 100000 - 50000)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15

I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215:

HEAD: 812.690

your patch: 821.354

strtoint from 8692f6644e7: 824.543

strtoint from 6b423ec677d^: 806.678

I'm surprised to see the imul version is faster. It's certainly not
what we found when working on 6b423ec67.

What CPUs did you test it on? I'd not be surprised if this were heavily
dependent on the microarchitecture.

I've been fooling around a bit to try to learn what might be going on.
I wrote 2 patches:

1) pg_strtoint_imul.patch: Effectively reverts 6b423ec67 and puts the
code how it likely would have looked had I not done that work at all.
(I've assumed that the hex, octal, binary parsing would have been
added using the overflow multiplication functions just as the base 10
parsing).

2) pg_strtoint_optimize.patch: I've made another pass over the
functions with the current overflow checks and optimized the digit
checking code so that it can be done in a single check like: if (digit
< 10). This can be done by assigning the result of *ptr - '0' to an
unsigned char. Anything less than '0' will wrap around and anything
above '9' will remain so. I've also removed a few inefficiencies with
the isspace checking. We didn't need to do "while (*ptr &&
isspace(*ptr)). It's fine just to do while (isspace(*ptr)) since '\0'
isn't a space char. I also got rid of the isxdigit call. The
hexlookup array contains -1 for non-hex chars. We may as well check
the digit value is >= 0.

Here are the results I get using your test as quoted above:

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch

latency average = 568.102 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch

latency average = 531.238 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch

latency average = 559.333 ms (surprisingly also faster on my machine)

The optimized version of the pg_strtoint functions wins over the imul
patch. Could you test to see if this is also the case in your Xeon
machine?

(these are the numbers with turbo disabled, if I enable it they're all in the
6xx ms range, but the variance is higher)

fix_COPY_DEFAULT.patch
774.344

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch
777.673

fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
777.545

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
795.298

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit())
773.477

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch
774.443

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit())
774.513

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_')
763.879

One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

fix_COPY_DEFAULT.patch + fastpath.patch:
746.971

fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
715.570

Now, the precise contents of this fastpath are not yet clear (wrt imul or
not), but I think the idea has promise.

(when I say strtoint from, I did not replace the goto labels, so that part is
unchanged and unrelated)

I didn't quite follow this.

I meant that I didn't undo ereport()->ereturn().

Greetings,

Andres Freund

Attachments:

fastpath.patchtext/x-diff; charset=us-asciiDownload

diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c
index 471fbb7ee63..52bbff801d0 100644
--- c/src/backend/utils/adt/numutils.c
+++ i/src/backend/utils/adt/numutils.c
@@ -301,6 +301,53 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	uint32		tmp = 0;
 	bool		neg = false;
 
+	/*
+	 * Fast path recognizing the most common output.
+	 */
+	if (1)
+	{
+		int32		tmp_s = 0;
+
+		/* leading spaces are uncommon => slow path */
+
+		/* handle - sign, + is uncommon => slow path */
+		if (*ptr == '-')
+		{
+			ptr++;
+			neg = true;
+		}
+
+		/* require at least one digit */
+		if (unlikely(!isdigit((unsigned char) *ptr)))
+			goto slow;
+
+		/* process digits */
+		while (*ptr && isdigit((unsigned char) *ptr))
+		{
+			int8		digit = (*ptr++ - '0');
+
+			if (unlikely(pg_mul_s32_overflow(tmp_s, 10, &tmp_s)) ||
+				unlikely(pg_sub_s32_overflow(tmp_s, digit, &tmp_s)))
+				goto out_of_range;
+		}
+
+		/* trailing spaces are uncommon => slow path */
+		if (unlikely(*ptr != '\0'))
+			goto slow;
+
+		if (!neg)
+		{
+			/* could fail if input is most negative number */
+			if (unlikely(tmp == PG_INT32_MIN))
+				goto out_of_range;
+			tmp = -tmp;
+		}
+
+		return tmp;
+	}
+
+slow:
+
 	/* skip leading spaces */
 	while (likely(*ptr) && isspace((unsigned char) *ptr))
 		ptr++;

isdigit.patchtext/x-diff; charset=us-asciiDownload

diff --git i/src/backend/utils/adt/numutils.c w/src/backend/utils/adt/numutils.c
index 52bbff801d0..2df6a50371d 100644
--- i/src/backend/utils/adt/numutils.c
+++ w/src/backend/utils/adt/numutils.c
@@ -293,6 +293,11 @@ pg_strtoint32(const char *s)
 	return pg_strtoint32_safe(s, NULL);
 }
 
+#if 1
+#undef isdigit
+#define isdigit(c) ((c) >= '0' && c <= '9')
+#endif
+
 int32
 pg_strtoint32_safe(const char *s, Node *escontext)
 {

#25

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Andres Freund (#24)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-25 08:50:19 -0700, Andres Freund wrote:

One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

fix_COPY_DEFAULT.patch + fastpath.patch:
746.971

fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
715.570

Now, the precise contents of this fastpath are not yet clear (wrt imul or
not), but I think the idea has promise.

Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows
up.

Greetings,

Andres Freund

#26

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andres Freund (#24)

2 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On 2023-07-25 23:37:08 +1200, David Rowley wrote:

On Tue, 25 Jul 2023 at 17:34, Andres Freund <andres@anarazel.de> wrote:

HEAD: 812.690

your patch: 821.354

strtoint from 8692f6644e7: 824.543

strtoint from 6b423ec677d^: 806.678

I'm surprised to see the imul version is faster. It's certainly not
what we found when working on 6b423ec67.

What CPUs did you test it on? I'd not be surprised if this were heavily
dependent on the microarchitecture.

This was on AMD 3990x.

One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

There were a couple of problems with fastpath.patch. You need to
reset the position of ptr at the start of the slow path and also you
were using tmp in the if (neg) part instead of tmp_s in the fast path
section.

I fixed that up and made two versions of the patch, one using the
overflow functions (pg_strtoint_fastpath1.patch) and one testing if
the number is going to overflow (same as current master)
(pg_strtoint_fastpath2.patch)

AMD 3990x:

master + fix_COPY_DEFAULT.patch:
latency average = 525.226 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 488.171 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 481.827 ms

Apple M2 Pro:

master + fix_COPY_DEFAULT.patch:
latency average = 348.433 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 336.778 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 335.992 ms

Zen 4 7945HX CPU:

master + fix_COPY_DEFAULT.patch:
latency average = 296.881 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 287.052 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 280.742 ms

The M2 chip does not seem to be clearly faster with the fastpath2
method of overflow checking, but both AMD CPUs seem pretty set on
fastpath2 being faster.

It would be really good if someone with another a newish intel CPU
could test this too.

David

Attachments:

pg_strtoint_fastpath1.patchtext/plain; charset=US-ASCII; name=pg_strtoint_fastpath1.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a50c5ccd4a 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -298,9 +298,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 {
 	const char *ptr = s;
 	const char *firstdigit;
-	uint32		tmp = 0;
+	uint32		tmp;
+	int32		tmp_s = 0;
 	bool		neg = false;
 
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process digits */
+	for (;;)
+	{
+		unsigned char		digit = (*ptr - '0');
+
+		/*
+		 * Exploit unsigned arithmetic to save having to check both the upper
+		 * and lower bounds of the digit
+		 */
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(pg_mul_s32_overflow(tmp_s, 10, &tmp_s)) ||
+			unlikely(pg_sub_s32_overflow(tmp_s, digit, &tmp_s)))
+			goto out_of_range;
+	}
+
+	/* we need at least one digit */
+	if (unlikely(ptr == s))
+		goto slow;
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (!neg)
+	{
+		/* could fail if input is most negative number */
+		if (unlikely(tmp_s == PG_INT32_MIN))
+			goto out_of_range;
+		tmp_s = -tmp_s;
+	}
+
+	return tmp_s;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
+
 	/* skip leading spaces */
 	while (likely(*ptr) && isspace((unsigned char) *ptr))
 		ptr++;

pg_strtoint_fastpath2.patchtext/plain; charset=US-ASCII; name=pg_strtoint_fastpath2.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..bb48e5157e 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -301,6 +301,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	uint32		tmp = 0;
 	bool		neg = false;
 
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process digits */
+	for (;;)
+	{
+		unsigned char		digit = (*ptr - '0');
+
+		/*
+		 * Exploit unsigned arithmetic to save having to check both the upper
+		 * and lower bounds of the digit
+		 */
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT32_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* we need at least one digit */
+	if (unlikely(ptr == s))
+		goto slow;
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint32) (-(PG_INT32_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int32) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT32_MAX))
+		goto out_of_range;
+
+	return (int32) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
+
 	/* skip leading spaces */
 	while (likely(*ptr) && isspace((unsigned char) *ptr))
 		ptr++;

#27

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andres Freund (#24)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 26 Jul 2023 at 03:50, Andres Freund <andres@anarazel.de> wrote:

On 2023-07-25 23:37:08 +1200, David Rowley wrote:

On Tue, 25 Jul 2023 at 17:34, Andres Freund <andres@anarazel.de> wrote:
I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a
reason to delay committing that? It would be good to eliminate that
as a variable for the current performance regression.

Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
want to commit it? Or Andrew?

Just to keep this moving and to make it easier for people to test the
pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch.
The only thing I changed was to move the line that was allocating the
array to a location more aligned with the order that the fields are
defined in the struct.

David

#28

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: David Rowley (#27)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Thu, 27 Jul 2023 at 14:51, David Rowley <dgrowleyml@gmail.com> wrote:

Just to keep this moving and to make it easier for people to test the
pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch.
The only thing I changed was to move the line that was allocating the
array to a location more aligned with the order that the fields are
defined in the struct.

I just did another round of benchmarking to see where we're at after
fox_COPY_DEFAULT.patch has been committed.

Below I've benchmarked REL_15_STABLE up to REL_16_STABLE with some
hand-selected key commits, many of which have been mentioned on this
thread already because they seem to affect the performance of COPY.

To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're
talking about another patch to speed up the pg_strtoint functions
which gets this down to 483.790 ms. Do we need to do this for v16 or
can we just leave this as it is already? The slowdown does not seem
to be much above what we'd ordinarily classify as noise using this
test on my machine.

Benchmark setup:

COPY (SELECT generate_series(1, 2000000) a, (random() * 100000 -
50000)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

Benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY
lotsaints FROM '/tmp/lotsaints.copy';") -t 15

2864eb977 REL_15_STABLE
latency average = 526.014 ms

8.84% postgres [.] pg_strtoint32

29452de73 Sat Dec 3 10:50:39 2022 -0500 The commit before "Improve
performance of pg_strtointNN functions"
latency average = 508.453 ms

10.21% postgres [.] pg_strtoint32

6b423ec67 Sun Dec 4 16:18:18 2022 +1300 Improve performance of
pg_strtointNN functions
latency average = 492.943 ms

7.73% postgres [.] pg_strtoint32

1939d2628 Fri Dec 9 10:08:44 2022 -0500 The commit before "Convert a
few datatype input functions to use "soft" error reporting."
latency average = 485.016 ms

8.43% postgres [.] pg_strtoint32

ccff2d20e Fri Dec 9 10:14:53 2022 -0500 Convert a few datatype input
functions to use "soft" error reporting.
latency average = 501.325 ms

6.90% postgres [.] pg_strtoint32_safe

60684dd83 Tue Dec 13 17:33:28 2022 -0800 The commit before
"Non-decimal integer literals"
latency average = 500.889 ms

8.27% postgres [.] pg_strtoint32_safe

6fcda9aba Wed Dec 14 05:40:38 2022 +0100 Non-decimal integer literals
latency average = 521.904 ms

9.02% postgres [.] pg_strtoint32_safe

1b6f632a3 Sat Feb 4 07:56:09 2023 +0100 The commit before "Allow
underscores in integer and numeric constants."
latency average = 523.195 ms

9.21% postgres [.] pg_strtoint32_safe

faff8f8e4 Sat Feb 4 09:48:51 2023 +0000 Allow underscores in integer
and numeric constants.
latency average = 493.064 ms

10.25% postgres [.] pg_strtoint32_safe

9f8377f7a Mon Mar 13 10:01:56 2023 -0400 Add a DEFAULT option to COPY FROM
latency average = 597.617 ms

12.91% postgres [.] CopyReadLine
10.62% postgres [.] CopyReadAttributesText
10.51% postgres [.] pg_strtoint32_safe
7.97% postgres [.] NextCopyFrom

REL_16_STABLE @ c1308ce2d Thu Jul 27 14:48:44 2023 +1200 Fix
performance problem with new COPY DEFAULT code
latency average = 530.344 ms

13.51% postgres [.] CopyReadLine
9.62% postgres [.] pg_strtoint32_safe
8.97% postgres [.] CopyReadAttributesText
8.43% postgres [.] NextCopyFrom

REL_16_STABLE + pg_strtoint_fastpath1.patch
latency average = 493.136 ms

13.79% postgres [.] CopyReadLine
11.82% postgres [.] CopyReadAttributesText
7.07% postgres [.] NextCopyFrom
6.81% postgres [.] pg_strtoint32_safe

REL_16_STABLE + pg_strtoint_fastpath2.patch
latency average = 483.790 ms

13.87% postgres [.] CopyReadLine
10.40% postgres [.] CopyReadAttributesText
8.22% postgres [.] NextCopyFrom
5.52% postgres [.] pg_strtoint32_safe

David

#29

John Naylor

john.naylor@enterprisedb.com

over 2 years ago

In reply to: David Rowley (#26)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Thu, Jul 27, 2023 at 7:17 AM David Rowley <dgrowleyml@gmail.com> wrote:

It would be really good if someone with another a newish intel CPU
could test this too.

I ran the lotsaints test from last email on an i7-10750H (~3 years old) and
got these results (gcc 13.1 , turbo off):

REL_15_STABLE:
latency average = 956.453 ms
latency stddev = 4.854 ms

REL_16_STABLE @ 695f5deb7902 (28-JUL-2023):
latency average = 999.354 ms
latency stddev = 3.611 ms

master @ 39055cb4cc (31-JUL-2023):
latency average = 995.310 ms
latency stddev = 5.176 ms

master + revert c1308ce2d (the replace-palloc0 fix)
latency average = 1080.909 ms
latency stddev = 8.707 ms

master + pg_strtoint_fastpath1.patch
latency average = 938.146 ms
latency stddev = 9.354 ms

master + pg_strtoint_fastpath2.patch
latency average = 902.808 ms
latency stddev = 3.957 ms

For me, PG16 seems to regress from PG15, and the second patch seems faster
than the first.

--
John Naylor
EDB: http://www.enterprisedb.com

#30

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: David Rowley (#28)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-07-27 20:53:16 +1200, David Rowley wrote:

To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're
talking about another patch to speed up the pg_strtoint functions
which gets this down to 483.790 ms. Do we need to do this for v16 or
can we just leave this as it is already? The slowdown does not seem
to be much above what we'd ordinarily classify as noise using this
test on my machine.

I think we need to do something for 16 - it appears on recent-ish AMD the
regression is quite a bit smaller than on intel. You see something < 1%, I
see more like 4%. I think there's also other cases where the slowdown is more
substantial.

Besides intel vs amd, it also looks like the gcc version might make a
difference. The code generated by 13 is noticeably slower than 12 for me...

Benchmark setup:

COPY (SELECT generate_series(1, 2000000) a, (random() * 100000 -
50000)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

There's a lot of larger numbers in the file, which likely reduces the impact
some. And there's the overhead of actually inserting the rows into the table,
making the difference appear smaller than it is.

If I avoid the actual insert into the table and use more columns, I see an about
10% regression here.

COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM generate_series(1, 10000)) TO '/tmp/lotsaints_wide.copy';

psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM '/tmp/lotsaints_wide.copy' WHERE false") -t 5

15: 2992.605
HEAD: 3325.201
fastpath1.patch 2932.606
fastpath2.patch 2783.915

Interestingly fastpath1 is slower now, even though it wasn't with earlier
patches (which still is repeatable). I do not have the foggiest as to why.

Greetings,

Andres Freund

#31

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: John Naylor (#29)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Mon, 31 Jul 2023 at 21:39, John Naylor <john.naylor@enterprisedb.com> wrote:

master + pg_strtoint_fastpath1.patch
latency average = 938.146 ms
latency stddev = 9.354 ms

master + pg_strtoint_fastpath2.patch
latency average = 902.808 ms
latency stddev = 3.957 ms

Thanks for checking those two on your machine. I'm glad to see
fastpath2 faster on yours too.

David

#32

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andres Freund (#30)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, 1 Aug 2023 at 13:25, Andres Freund <andres@anarazel.de> wrote:

There's a lot of larger numbers in the file, which likely reduces the impact
some. And there's the overhead of actually inserting the rows into the table,
making the difference appear smaller than it is.

It might be worth special casing the first digit as we can save doing
the multiplication by 10 and the overflow checks on the first digit.
That should make it slightly faster to parse smaller numbers.

COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM generate_series(1, 10000)) TO '/tmp/lotsaints_wide.copy';

psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM '/tmp/lotsaints_wide.copy' WHERE false") -t 5

15: 2992.605
HEAD: 3325.201
fastpath1.patch 2932.606
fastpath2.patch 2783.915

Interestingly fastpath1 is slower now, even though it wasn't with earlier
patches (which still is repeatable). I do not have the foggiest as to why.

I'm glad to see that.

I've adjusted the patch to add the fast path for the 16 and 64-bit
versions of the function. I also added the special case for
processing the first digit, which looks like:

/* process the first digit */
digit = (*ptr - '0');

if (likely(digit < 10))
{
ptr++;
tmp = digit;
}

/* process remaining digits */
for (;;)

I tried adding the "at least 1 digit check" by adding an else { goto
slow; } in the above code, but it seems to generate slower code than
just checking if (unlikely(ptr == s)) { goto slow; } after the loop.

I also noticed that I wasn't getting the same performance after
adjusting the 16 and 64 bit versions. I assume that's down to code
alignment, but unsure of that. I ended up adjusting all the "while
(*ptr)" loops into "for (;;)" loops
since the NUL char check is handled by the "else break;". I also
removed the needless NUL char check in the isspace loops. It can't be
isspace and '\0'. I also replaced the isdigit() function call and
replaced it for manually checking the digit range. I see my compiler
(gcc12.2) effectively generates the same code as the unsigned char
fast path version checking if (digit < 10). Once I did that, I got the
performance back again.

With your new test with the small-sized ints, I get:

REL_15_STABLE:
latency average = 1696.390 ms

master @ d3a38318a
latency average = 1928.803 ms

master + fastpath1.patch:
latency average = 1634.736 ms

master + fastpath2.patch:
latency average = 1628.704 ms

master + fastpath3.patch
latency average = 1590.417 ms

I see no particular reason not to go ahead with the attached patch and
get this thread closed off. Any objections?

David

Attachments:

pg_strtoint_fastpath3.patchapplication/octet-stream; name=pg_strtoint_fastpath3.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..ce259efed9 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -122,9 +122,83 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint16		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT16_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* we need at least one digit */
+	if (unlikely(ptr == s))
+		goto slow;
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint16) (-(PG_INT16_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int16) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT16_MAX))
+		goto out_of_range;
+
+	return (int16) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -141,7 +215,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -165,7 +239,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -189,7 +263,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -213,9 +287,9 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 10)))
 					goto out_of_range;
@@ -242,7 +316,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))
@@ -300,9 +374,83 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint32		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT32_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* we need at least one digit */
+	if (unlikely(ptr == s))
+		goto slow;
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint32) (-(PG_INT32_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int32) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT32_MAX))
+		goto out_of_range;
+
+	return (int32) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -319,7 +467,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -343,7 +491,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -367,7 +515,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -391,9 +539,9 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 					goto out_of_range;
@@ -420,7 +568,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))
@@ -478,9 +626,83 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint64		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT64_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* we need at least one digit */
+	if (unlikely(ptr == s))
+		goto slow;
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint64) (-(PG_INT64_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int64) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT64_MAX))
+		goto out_of_range;
+
+	return (int64) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (*ptr && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -497,7 +719,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -521,7 +743,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -545,7 +767,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -569,9 +791,9 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 10)))
 					goto out_of_range;
@@ -598,7 +820,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))

#33

Dean Rasheed

dean.a.rasheed@gmail.com

over 2 years ago

In reply to: David Rowley (#32)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, 1 Aug 2023 at 13:55, David Rowley <dgrowleyml@gmail.com> wrote:

I tried adding the "at least 1 digit check" by adding an else { goto
slow; } in the above code, but it seems to generate slower code than
just checking if (unlikely(ptr == s)) { goto slow; } after the loop.

That check isn't quite right, because "ptr" will not equal "s" if
there is a sign character, so it won't detect an input with no digits
in that case.

Regards,
Dean

#34

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Dean Rasheed (#33)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 2 Aug 2023 at 01:26, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Tue, 1 Aug 2023 at 13:55, David Rowley <dgrowleyml@gmail.com> wrote:

I tried adding the "at least 1 digit check" by adding an else { goto
slow; } in the above code, but it seems to generate slower code than
just checking if (unlikely(ptr == s)) { goto slow; } after the loop.

That check isn't quite right, because "ptr" will not equal "s" if
there is a sign character, so it won't detect an input with no digits
in that case.

Ah, yeah. Thanks.

Here's a patch with an else condition when the first digit check fails.

master + fastpath4.patch:
latency average = 1579.576 ms
latency average = 1572.716 ms
latency average = 1563.398 ms

(appears slightly faster than fastpath3.patch)

David

Attachments:

pg_strtoint_fastpath4.patchapplication/octet-stream; name=pg_strtoint_fastpath4.patchDownload

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..7067bc5d48 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -122,9 +122,84 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint16		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+	else
+	{
+		/* we need at least one digit */
+		goto slow;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT16_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint16) (-(PG_INT16_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int16) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT16_MAX))
+		goto out_of_range;
+
+	return (int16) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -141,7 +216,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -165,7 +240,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -189,7 +264,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -213,9 +288,9 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT16_MIN / 10)))
 					goto out_of_range;
@@ -242,7 +317,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))
@@ -300,9 +375,84 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint32		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+	else
+	{
+		/* we need at least one digit */
+		goto slow;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT32_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint32) (-(PG_INT32_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int32) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT32_MAX))
+		goto out_of_range;
+
+	return (int32) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (likely(*ptr) && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -319,7 +469,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -343,7 +493,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -367,7 +517,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -391,9 +541,9 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 					goto out_of_range;
@@ -420,7 +570,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))
@@ -478,9 +628,84 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	const char *firstdigit;
 	uint64		tmp = 0;
 	bool		neg = false;
+	unsigned char digit;
+
+	/*
+	 * The majority of cases are likely to be base-10 digits without any
+	 * underscore separator characters.  We'll first try to parse the string with
+	 * the assumption that's the case and only fallback on a slower
+	 * implementation which handles hex, octal and binary strings and
+	 * underscores if the fastpath version cannot parse the string.
+	 */
+
+	/* leave it up to the slow path to look for leading spaces */
+
+	if (*ptr == '-')
+	{
+		ptr++;
+		neg = true;
+	}
+
+	/* a leading '+' is uncommon so leave that for the slow path */
+
+	/* process the first digit */
+	digit = (*ptr - '0');
+
+	/*
+	 * Exploit unsigned arithmetic to save having to check both the upper and
+	 * lower bounds of the digit.
+	 */
+	if (likely(digit < 10))
+	{
+		ptr++;
+		tmp = digit;
+	}
+	else
+	{
+		/* we need at least one digit */
+		goto slow;
+	}
+
+	/* process remaining digits */
+	for (;;)
+	{
+		digit = (*ptr - '0');
+
+		if (digit >= 10)
+			break;
+
+		ptr++;
+
+		if (unlikely(tmp > -(PG_INT64_MIN / 10)))
+			goto out_of_range;
+
+		tmp = tmp * 10 + digit;
+	}
+
+	/* when the string does not end in a digit, let the slow path handle it */
+	if (unlikely(*ptr != '\0'))
+		goto slow;
+
+	if (neg)
+	{
+		/* check the negative equivalent will fit without overflowing */
+		if (unlikely(tmp > (uint64) (-(PG_INT64_MIN + 1)) + 1))
+			goto out_of_range;
+		return -((int64) tmp);
+	}
+
+	if (unlikely(tmp > PG_INT64_MAX))
+		goto out_of_range;
+
+	return (int64) tmp;
+
+slow:
+	tmp = 0;
+	ptr = s;
+	/* no need to reset neg */
 
 	/* skip leading spaces */
-	while (*ptr && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	/* handle sign */
@@ -497,7 +722,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (isxdigit((unsigned char) *ptr))
 			{
@@ -521,7 +746,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '7')
 			{
@@ -545,7 +770,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr += 2;
 
-		while (*ptr)
+		for (;;)
 		{
 			if (*ptr >= '0' && *ptr <= '1')
 			{
@@ -569,9 +794,9 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 	{
 		firstdigit = ptr;
 
-		while (*ptr)
+		for (;;)
 		{
-			if (isdigit((unsigned char) *ptr))
+			if (*ptr >= '0' && *ptr <= '9')
 			{
 				if (unlikely(tmp > -(PG_INT64_MIN / 10)))
 					goto out_of_range;
@@ -598,7 +823,7 @@ pg_strtoint64_safe(const char *s, Node *escontext)
 		goto invalid_syntax;
 
 	/* allow trailing whitespace, but not other trailing chars */
-	while (*ptr != '\0' && isspace((unsigned char) *ptr))
+	while (isspace((unsigned char) *ptr))
 		ptr++;
 
 	if (unlikely(*ptr != '\0'))

#35

Dean Rasheed

dean.a.rasheed@gmail.com

over 2 years ago

In reply to: David Rowley (#34)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, 1 Aug 2023 at 15:01, David Rowley <dgrowleyml@gmail.com> wrote:

Here's a patch with an else condition when the first digit check fails.

master + fastpath4.patch:
latency average = 1579.576 ms
latency average = 1572.716 ms
latency average = 1563.398 ms

(appears slightly faster than fastpath3.patch)

Running the new test on slightly older Intel hardware (i9-9900K, gcc
11.3), I get the following:

REL_15_STABLE
latency average = 1687.695 ms
latency stddev = 3.068 ms

REL_16_STABLE
latency average = 1931.756 ms
latency stddev = 2.065 ms

REL_16_STABLE + pg_strtoint_fastpath1.patch
latency average = 1635.731 ms
latency stddev = 3.074 ms

REL_16_STABLE + pg_strtoint_fastpath2.patch
latency average = 1687.303 ms
latency stddev = 4.243 ms

REL_16_STABLE + pg_strtoint_fastpath3.patch
latency average = 1610.307 ms
latency stddev = 2.193 ms

REL_16_STABLE + pg_strtoint_fastpath4.patch
latency average = 1577.604 ms
latency stddev = 4.060 ms

HEAD
latency average = 1868.737 ms
latency stddev = 6.114 ms

HEAD + pg_strtoint_fastpath1.patch
latency average = 1683.215 ms
latency stddev = 2.322 ms

HEAD + pg_strtoint_fastpath2.patch
latency average = 1650.014 ms
latency stddev = 3.920 ms

HEAD + pg_strtoint_fastpath3.patch
latency average = 1670.337 ms
latency stddev = 5.074 ms

HEAD + pg_strtoint_fastpath4.patch
latency average = 1653.568 ms
latency stddev = 8.224 ms

I don't know why HEAD and v16 aren't consistent, but it's seems to be
quite reproducible, even though the numutils source is the same in
both branches, and using gdb to dump the disassembly for
pg_strtoint32_safe() shows that it's also the same.

Anyway, insofar as these results can be trusted, fastpath4.patch looks good.

Regards,
Dean

#36

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Dean Rasheed (#35)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 2 Aug 2023 at 07:38, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Running the new test on slightly older Intel hardware (i9-9900K, gcc
11.3), I get the following:

Thanks for running those tests. I've now pushed the fastpath4.patch
after making a few adjustments to the header comments to mention the
new stuff that was added in v16.

I don't know why HEAD and v16 aren't consistent, but it's seems to be
quite reproducible, even though the numutils source is the same in
both branches, and using gdb to dump the disassembly for
pg_strtoint32_safe() shows that it's also the same.

I also see it's inconsistent, but the other way around. Here are some
fresh tests with master and REL_16_STABLE with the committed code and
the version directly prior to the commit:

master @ 3845577cb
latency average = 1575.879 ms

6.79% postgres [.] pg_strtoint32_safe

master~1
latency average = 1968.004 ms

14.28% postgres [.] pg_strtoint32_safe

REL_16_STABLE
latency average = 1735.163 ms

6.04% postgres [.] pg_strtoint32_safe

REL_16_STABLE~1
latency average = 2188.186 ms

13.83% postgres [.] pg_strtoint32_safe

David

#37

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: David Rowley (#36)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 2 Aug 2023 at 12:25, David Rowley <dgrowleyml@gmail.com> wrote:

master @ 3845577cb
latency average = 1575.879 ms

6.79% postgres [.] pg_strtoint32_safe

master~1
latency average = 1968.004 ms

14.28% postgres [.] pg_strtoint32_safe

REL_16_STABLE
latency average = 1735.163 ms

6.04% postgres [.] pg_strtoint32_safe

REL_16_STABLE~1
latency average = 2188.186 ms

13.83% postgres [.] pg_strtoint32_safe

And just to complete that set, here's the REL_15_STABLE performance
using the same test:

latency average = 1829.108 ms

15.46% postgres [.] pg_strtoint32

So, it looks like this item can be closed off. I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

David

#38

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: David Rowley (#37)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Wed, 2 Aug 2023 at 13:35, David Rowley <dgrowleyml@gmail.com> wrote:

So, it looks like this item can be closed off. I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

Alright, closed.

David

#39

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: David Rowley (#38)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Mon, Aug 7, 2023 at 3:16 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 2 Aug 2023 at 13:35, David Rowley <dgrowleyml@gmail.com> wrote:

So, it looks like this item can be closed off. I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

Alright, closed.

IIUC the problem with multiple concurrent COPY is not resolved yet.
I've run the same benchmark that I used for the first report:

* PG15 (cb2ae5741f)
nclients = 1, execution time = 15.213
nclients = 2, execution time = 9.470
nclients = 4, execution time = 6.508
nclients = 8, execution time = 4.526
nclients = 16, execution time = 3.788
nclients = 32, execution time = 3.837
nclients = 64, execution time = 4.286
nclients = 128, execution time = 4.388
nclients = 256, execution time = 4.333

* PG16 (67a007dc0c)
nclients = 1, execution time = 14.494
nclients = 2, execution time = 12.962
nclients = 4, execution time = 17.757
nclients = 8, execution time = 10.865
nclients = 16, execution time = 7.371
nclients = 32, execution time = 4.929
nclients = 64, execution time = 2.212
nclients = 128, execution time = 2.020
nclients = 256, execution time = 2.196

The result of nclients = 1 became better thanks to recent fixes, but
there still seems to be the performance regression at nclient = 2~16
(on RHEL 8 and 9). Andres reported[1]/messages/by-id/20230711185159.v2j5vnyrtodnwhgz@awork3.anarazel.de that after changing
MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
would not be the solution, as he mentioned.

Regards,

[1]: /messages/by-id/20230711185159.v2j5vnyrtodnwhgz@awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#40

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#39)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:

On Mon, Aug 7, 2023 at 3:16 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 2 Aug 2023 at 13:35, David Rowley <dgrowleyml@gmail.com> wrote:

So, it looks like this item can be closed off. I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

Alright, closed.

IIUC the problem with multiple concurrent COPY is not resolved yet.

Yea - it was just hard to analyze until the other regressions were fixed.

The result of nclients = 1 became better thanks to recent fixes, but
there still seems to be the performance regression at nclient = 2~16
(on RHEL 8 and 9). Andres reported[1] that after changing
MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
would not be the solution, as he mentioned.

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1]COPY (SELECT generate_series(1, 100000)) TO '/tmp/data.copy';, to have more accurate throughput results at shorter runs
(15s).

HEAD before:
clients tps
1 41
2 76
4 136
8 248
16 360
32 375
64 317

HEAD after:
clients tps
1 43
2 80
4 155
8 280
16 369
32 405
64 344

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Greetings,

Andres Freund

[1]: COPY (SELECT generate_series(1, 100000)) TO '/tmp/data.copy';

Attachments:

remember_extension_size.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/include/access/hio.h w/src/include/access/hio.h
index 228433ee4a2..5ae39aec7f8 100644
--- i/src/include/access/hio.h
+++ w/src/include/access/hio.h
@@ -41,6 +41,7 @@ typedef struct BulkInsertStateData
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -1776,6 +1776,7 @@ GetBulkInsertState(void)
 	bistate->current_buf = InvalidBuffer;
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
+	bistate->already_extended_by = 0;
 	return bistate;
 }
 
diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c
index c275b08494d..a2be4273df1 100644
--- i/src/backend/access/heap/hio.c
+++ w/src/backend/access/heap/hio.c
@@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 */
 		extend_by_pages += extend_by_pages * waitcount;
 
+		/*
+		 * If we previously extended using the same bistate, it's very likely
+		 * we'll extend some more. Try to extend by as many pages.
+		 */
+		if (bistate)
+			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
 		 * them all concurrently.
@@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+		bistate->already_extended_by += extend_by_pages;
 	}
 
 	return buffer;

#41

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Andres Freund (#40)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On Tue, Aug 8, 2023 at 3:10 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:

On Mon, Aug 7, 2023 at 3:16 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 2 Aug 2023 at 13:35, David Rowley <dgrowleyml@gmail.com> wrote:

So, it looks like this item can be closed off. I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

Alright, closed.

IIUC the problem with multiple concurrent COPY is not resolved yet.

Yea - it was just hard to analyze until the other regressions were fixed.

The result of nclients = 1 became better thanks to recent fixes, but
there still seems to be the performance regression at nclient = 2~16
(on RHEL 8 and 9). Andres reported[1] that after changing
MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
would not be the solution, as he mentioned.

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1], to have more accurate throughput results at shorter runs
(15s).

Thank you for the patch!

HEAD before:
clients tps
1 41
2 76
4 136
8 248
16 360
32 375
64 317

HEAD after:
clients tps
1 43
2 80
4 155
8 280
16 369
32 405
64 344

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Sure. The results are promising for me too:

nclients = 1, execution time = 13.743
nclients = 2, execution time = 7.552
nclients = 4, execution time = 4.758
nclients = 8, execution time = 3.035
nclients = 16, execution time = 2.172
nclients = 32, execution time = 1.959
nclients = 64, execution time = 1.819
nclients = 128, execution time = 1.583
nclients = 256, execution time = 1.631

Here are results of the same benchmark test you used:

w/o patch:
clients tps
1 66.702
2 59.696
4 73.783
8 168.115
16 400.134
32 574.098
64 565.373
128 526.303
256 591.751

w/ patch:
clients tps
1 67.735
2 122.534
4 240.707
8 398.944
16 541.097
32 643.083
64 614.775
128 616.007
256 577.885

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#42

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#41)

1 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote:

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1], to have more accurate throughput results at shorter runs
(15s).

Thank you for the patch!

Attached is a mildly updated version of that patch. No functional changes,
just polished comments and added a commit message.

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Sure. The results are promising for me too:

nclients = 1, execution time = 13.743
nclients = 2, execution time = 7.552
nclients = 4, execution time = 4.758
nclients = 8, execution time = 3.035
nclients = 16, execution time = 2.172
nclients = 32, execution time = 1.959
nclients = 64, execution time = 1.819
nclients = 128, execution time = 1.583
nclients = 256, execution time = 1.631

Nice. We are consistently better than both 15 and "post integer parsing 16".

I'm really a bit baffled at myself for not using this approach from the get
go.

This also would make it much more beneficial to use a BulkInsertState in
nodeModifyTable.c, even without converting to table_multi_insert().

I was tempted to optimize RelationAddBlocks() a bit, by not calling
RelationExtensionLockWaiterCount() if we are already extending by
MAX_BUFFERS_TO_EXTEND_BY. Before this patch, it was pretty much impossible to
reach that case, because of the MAX_BUFFERED_* limits in copyfrom.c, but now
it's more common. But that probably ought to be done only HEAD, not 16.

A related oddity: RelationExtensionLockWaiterCount()->LockWaiterCount() uses
an exclusive lock on the lock partition - which seems not at all necessary?

Unless somebody sees a reason not to, I'm planning to push this?

Greetings,

Andres Freund

Attachments:

v2-0001-hio-Take-number-of-prior-relation-extensions-into.patchtext/x-diff; charset=us-asciiDownload

From 26e7faad65273beefeb7b40a169c1b631e204501 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 12 Aug 2023 12:31:43 -0700
Subject: [PATCH v2] hio: Take number of prior relation extensions into account

The new relation extension logic, introduced in 00d1e02be24, could lead to
slowdowns in some scenarios. E.g., when loading narrow rows into a table using
COPY, the caller of RelationGetBufferForTuple() will only request a small
number of pages. Without concurrency, we just extended using pwritev() in that
case. However, if there is *some* concurrency, we switched between extending
by a small number of pages and a larger number of pages, depending on the
number of waiters for the relation extension logic.  However, some
filesystems, XFS in particular, do not perform well when switching between
extending files using fallocate() and pwritev().

To avoid that issue, remember the number of prior relation extensions in
BulkInsertState and extend more aggressively if there were prior relation
extensions. That not just avoids the aforementioned slowdown, but also leads
to noticeable performance gains in other situations, primarily due to
extending more aggressively when there is no concurrency. I should have done
it this way from the get go.

Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com
Backpatch: 16-, where the new relation extension code was added
---
 src/include/access/hio.h         | 13 ++++++++++---
 src/backend/access/heap/heapam.c |  1 +
 src/backend/access/heap/hio.c    | 19 +++++++++++++++++++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 228433ee4a2..9bc563b7628 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -32,15 +32,22 @@ typedef struct BulkInsertStateData
 	Buffer		current_buf;	/* current insertion target page */
 
 	/*
-	 * State for bulk extensions. Further pages that were unused at the time
-	 * of the extension. They might be in use by the time we use them though,
-	 * so rechecks are needed.
+	 * State for bulk extensions.
+	 *
+	 * last_free..next_free are further pages that were unused at the time of
+	 * the last extension. They might be in use by the time we use them
+	 * though, so rechecks are needed.
 	 *
 	 * XXX: Eventually these should probably live in RelationData instead,
 	 * alongside targetblock.
+	 *
+	 * already_extended_by is the number of pages that this bulk inserted
+	 * extended by. If we already extended by a significant number of pages,
+	 * we can be more aggressive about extending going forward.
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1776,6 +1776,7 @@ GetBulkInsertState(void)
 	bistate->current_buf = InvalidBuffer;
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
+	bistate->already_extended_by = 0;
 	return bistate;
 }
 
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index c275b08494d..21f808fecb5 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -283,6 +283,24 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 */
 		extend_by_pages += extend_by_pages * waitcount;
 
+		/* ---
+		 * If we previously extended using the same bistate, it's very likely
+		 * we'll extend some more. Try to extend by as many pages as
+		 * before. This can be important for performance for several reasons,
+		 * including:
+		 *
+		 * - It prevents mdzeroextend() switching between extending the
+		 *   relation in different ways, which is inefficient for some
+		 *   filesystems.
+		 *
+		 * - Contention is often intermittent. Even if we currently don't see
+		 *   other waiters (see above), extending by larger amounts can
+		 *   prevent future contention.
+		 * ---
+		 */
+		if (bistate)
+			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
 		 * them all concurrently.
@@ -409,6 +427,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+		bistate->already_extended_by += extend_by_pages;
 	}
 
 	return buffer;
-- 
2.38.0

#43

Alvaro Herrera

alvherre@alvh.no-ip.org

over 2 years ago

In reply to: Andres Freund (#42)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hello,

On 2023-Aug-12, Andres Freund wrote:

On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote:

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Sure. The results are promising for me too:

nclients = 1, execution time = 13.743
nclients = 2, execution time = 7.552
nclients = 4, execution time = 4.758
nclients = 8, execution time = 3.035
nclients = 16, execution time = 2.172
nclients = 32, execution time = 1.959
nclients = 64, execution time = 1.819
nclients = 128, execution time = 1.583
nclients = 256, execution time = 1.631

Nice. We are consistently better than both 15 and "post integer parsing 16".

Since the wins from this patch were replicated and it has been pushed, I
understand that this open item can be marked as closed, so I've done
that.

Thanks,

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)

#44

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Alvaro Herrera (#43)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote:

Since the wins from this patch were replicated and it has been pushed, I
understand that this open item can be marked as closed, so I've done
that.

Thanks!

#45

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#44)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Andres Freund <andres@anarazel.de> writes:

On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote:

Since the wins from this patch were replicated and it has been pushed, I
understand that this open item can be marked as closed, so I've done
that.

Thanks!

It turns out that this patch is what's making buildfarm member
chipmunk fail in contrib/pg_visibility [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-09-06%2014%3A14%3A51. That's easily reproduced
by running the test with shared_buffers = 10MB. I didn't dig further
than the "git bisect" result:

$ git bisect bad
82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit
commit 82a4edabd272f70d044faec8cf7fd1eab92d9991
Author: Andres Freund <andres@anarazel.de>
Date: Mon Aug 14 09:54:03 2023 -0700

hio: Take number of prior relation extensions into account

but I imagine the problem is that the patch's more aggressive
relation-extension heuristic is causing the table to have more
pages than the test case expects. Is it really a good idea
for such a heuristic to depend on shared_buffers? If you don't
want to change the heuristic then we'll have to find a way to
tweak this test to avoid it.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-09-06%2014%3A14%3A51

#46

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#45)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-09-06 18:01:53 -0400, Tom Lane wrote:

It turns out that this patch is what's making buildfarm member
chipmunk fail in contrib/pg_visibility [1]. That's easily reproduced
by running the test with shared_buffers = 10MB. I didn't dig further
than the "git bisect" result:

At first I was a bit confounded by not being able to reproduce this. My test
environment had max_connections=110 for some other reason - and the problem
doesn't reproduce with that setting...

$ git bisect bad
82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit
commit 82a4edabd272f70d044faec8cf7fd1eab92d9991
Author: Andres Freund <andres@anarazel.de>
Date: Mon Aug 14 09:54:03 2023 -0700

hio: Take number of prior relation extensions into account

but I imagine the problem is that the patch's more aggressive
relation-extension heuristic is causing the table to have more
pages than the test case expects. Is it really a good idea
for such a heuristic to depend on shared_buffers?

The heuristic doesn't directly depend on shared buffers. However, the amount
we extend by is limited by needing to pin shared buffers covering all the
newly extended buffers.

That's what ends up limiting things here - shared_buffers = 10MB and
max_connections = 10 doesn't allow for a lot of buffers to be pinned
concurrently in each backend. Although perhaps LimitAdditionalPins() is a bit
too conservative, due to not checking the private refcount array and just
assuming REFCOUNT_ARRAY_ENTRIES.

If you don't want to change the heuristic then we'll have to find a way to
tweak this test to avoid it.

We could tweak LimitAdditionalPins() by checking PrivateRefCountArray instead
of assuming the worst-case REFCOUNT_ARRAY_ENTRIES.

However, it seems that the logic in the test is pretty fragile independent of
this issue? Different alignment, block size or an optimization of the page
layout could also break the test?

Unfortunately a query that doesn't falsely alert in this case is a bit ugly,
due to needing to deal with the corner case of an empty page at the end:

select *
from pg_visibility_map('copyfreeze')
where
(not all_visible or not all_frozen)
-- deal with trailing empty pages due to potentially bulk-extending too aggressively
and exists(SELECT * FROM copyfreeze WHERE ctid >= ('('||blkno||', 0)')::tid)
;

Before 82a4edabd27 this situation was rare - you'd have needed contended
extensions. But after it has become more common. I worry that that might cause
other issues :(. OTOH, I think we'll need to extend way more aggressively at
some point...

Greetings,

Andres Freund

#47

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#46)

2 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Andres Freund <andres@anarazel.de> writes:

On 2023-09-06 18:01:53 -0400, Tom Lane wrote:

It turns out that this patch is what's making buildfarm member
chipmunk fail in contrib/pg_visibility [1]. That's easily reproduced
by running the test with shared_buffers = 10MB. I didn't dig further
than the "git bisect" result:

At first I was a bit confounded by not being able to reproduce this. My test
environment had max_connections=110 for some other reason - and the problem
doesn't reproduce with that setting...

I just did a git bisect run to discover when the failure documented
in bug #18130 [1]/messages/by-id/18130-7a86a7356a75209d@postgresql.org started. And the answer is commit 82a4edabd.
Now, it's pretty obvious that that commit didn't in itself cause
problems like this:

postgres=# \copy test from 'bug18130.csv' csv
ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 bytes
CONTEXT: COPY test, line 472: "0,185647715,222655,489637,2,2023-07-31,9100.0000000,302110385,2023-07-30 14:16:36.750981+00,14026347..."

IMO there must be some very nasty bug lurking in the new
multiple-block extension logic, that happens to be exposed by this
test case with 82a4edabd's adjustments to the when-to-extend choices
but wasn't before that.

To save other people the trouble of extracting the in-line data
in the bug submission, I've attached the test files I was using.
The DDL is simplified slightly from what was submitted. I'm not
entirely sure why a no-op trigger is needed to provoke the bug...

regards, tom lane

[1]: /messages/by-id/18130-7a86a7356a75209d@postgresql.org

#48

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#47)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started. And the answer is commit 82a4edabd.
Now, it's pretty obvious that that commit didn't in itself cause
problems like this:

postgres=# \copy test from 'bug18130.csv' csv
ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 bytes
CONTEXT: COPY test, line 472: "0,185647715,222655,489637,2,2023-07-31,9100.0000000,302110385,2023-07-30 14:16:36.750981+00,14026347..."

Ugh.

IMO there must be some very nasty bug lurking in the new
multiple-block extension logic, that happens to be exposed by this
test case with 82a4edabd's adjustments to the when-to-extend choices
but wasn't before that.

To save other people the trouble of extracting the in-line data
in the bug submission, I've attached the test files I was using.

Thanks, looking at this now.

The DDL is simplified slightly from what was submitted. I'm not
entirely sure why a no-op trigger is needed to provoke the bug...

A trigger might prevent using the multi-insert API, which would lead to
different execution paths...

Greetings,

Andres Freund

#49

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Andres Freund (#48)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-09-25 12:48:30 -0700, Andres Freund wrote:

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started. And the answer is commit 82a4edabd.
Now, it's pretty obvious that that commit didn't in itself cause
problems like this:

postgres=# \copy test from 'bug18130.csv' csv
ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 bytes
CONTEXT: COPY test, line 472: "0,185647715,222655,489637,2,2023-07-31,9100.0000000,302110385,2023-07-30 14:16:36.750981+00,14026347..."

Ugh.

IMO there must be some very nasty bug lurking in the new
multiple-block extension logic, that happens to be exposed by this
test case with 82a4edabd's adjustments to the when-to-extend choices
but wasn't before that.

To save other people the trouble of extracting the in-line data
in the bug submission, I've attached the test files I was using.

Thanks, looking at this now.

(had to switch locations in between)

Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple
partitions. Which to me seems to run counter to the following comment:
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.

The reason this doesn't cause straight up corruption due to reusing a pin from
another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
insertion state, which is what leads to the errors from the bug report.

Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
sure that's the right fix. ISTM that independent of whether we fix this via
ReleaseBulkInsertStatePin() resetting the fields or via not reusing
BulkInsertState, we should add assertions defending against future issues like
this (e.g. by adding a relation field to BulkInsertState in cassert builds,
and asserting that the relation is the same as in prior calls unless
ReleaseBulkInsertStatePin() has been called).

Greetings,

Andres Freund

#50

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andres Freund (#49)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Andres Freund <andres@anarazel.de> writes:

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started. And the answer is commit 82a4edabd.

Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple
partitions. Which to me seems to run counter to the following comment:
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.

The reason this doesn't cause straight up corruption due to reusing a pin from
another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
insertion state, which is what leads to the errors from the bug report.

Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
sure that's the right fix. ISTM that independent of whether we fix this via
ReleaseBulkInsertStatePin() resetting the fields or via not reusing
BulkInsertState, we should add assertions defending against future issues like
this (e.g. by adding a relation field to BulkInsertState in cassert builds,
and asserting that the relation is the same as in prior calls unless
ReleaseBulkInsertStatePin() has been called).

Ping? We really ought to have a fix for this committed in time for
16.1.

regards, tom lane

#51

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tom Lane (#50)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-10-12 11:44:09 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started. And the answer is commit 82a4edabd.

Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple
partitions. Which to me seems to run counter to the following comment:
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.

The reason this doesn't cause straight up corruption due to reusing a pin from
another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
insertion state, which is what leads to the errors from the bug report.

Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
sure that's the right fix. ISTM that independent of whether we fix this via
ReleaseBulkInsertStatePin() resetting the fields or via not reusing
BulkInsertState, we should add assertions defending against future issues like
this (e.g. by adding a relation field to BulkInsertState in cassert builds,
and asserting that the relation is the same as in prior calls unless
ReleaseBulkInsertStatePin() has been called).

Ping? We really ought to have a fix for this committed in time for
16.1.

I kind of had hoped somebody would comment on the approach. Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

Greetings,

Andres Freund

#52

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Andres Freund (#51)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-10-12 09:24:19 -0700, Andres Freund wrote:

On 2023-10-12 11:44:09 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started. And the answer is commit 82a4edabd.

Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple
partitions. Which to me seems to run counter to the following comment:
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.

The reason this doesn't cause straight up corruption due to reusing a pin from
another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
insertion state, which is what leads to the errors from the bug report.

Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
sure that's the right fix. ISTM that independent of whether we fix this via
ReleaseBulkInsertStatePin() resetting the fields or via not reusing
BulkInsertState, we should add assertions defending against future issues like
this (e.g. by adding a relation field to BulkInsertState in cassert builds,
and asserting that the relation is the same as in prior calls unless
ReleaseBulkInsertStatePin() has been called).

Ping? We really ought to have a fix for this committed in time for
16.1.

I kind of had hoped somebody would comment on the approach. Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

I spent some time working on a test that shows the problem more cheaply than
the case upthread. I think it'd be desirable to have a test that's likely to
catch an issue like this fairly quickly. We've had other problems in this
realm before - there's only a single test that fails if I remove the
ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all.

I'm a bit on the fence on how large to make the relation. For me the bug
triggers when filling both relations up to the 3rd block, but how many rows
that takes is somewhat dependent on space utilization on the page and stuff.

Right now the test uses data/desc.data and ends up with 328kB and 312kB in two
partitions. Alternatively I could make the test create a new file to load with
copy that has fewer rows than data/desc.data - I didn't see another data file
that works conveniently and has fewer rows. The copy is reasonably fast, even
under something as expensive as rr (~60ms). So I'm inclined to just go with
that?

Greetings,

Andres Freund

#53

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Andres Freund (#52)

2 attachment(s)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-10-13 10:39:10 -0700, Andres Freund wrote:

On 2023-10-12 09:24:19 -0700, Andres Freund wrote:

I kind of had hoped somebody would comment on the approach. Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

I spent some time working on a test that shows the problem more cheaply than
the case upthread. I think it'd be desirable to have a test that's likely to
catch an issue like this fairly quickly. We've had other problems in this
realm before - there's only a single test that fails if I remove the
ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all.

I'm a bit on the fence on how large to make the relation. For me the bug
triggers when filling both relations up to the 3rd block, but how many rows
that takes is somewhat dependent on space utilization on the page and stuff.

Right now the test uses data/desc.data and ends up with 328kB and 312kB in two
partitions. Alternatively I could make the test create a new file to load with
copy that has fewer rows than data/desc.data - I didn't see another data file
that works conveniently and has fewer rows. The copy is reasonably fast, even
under something as expensive as rr (~60ms). So I'm inclined to just go with
that?

Patch with fix and test attached (0001).

Given how easy a missing ReleaseBulkInsertStatePin() can cause corruption (not
due to this bug, but the issue fixed in b1ecb9b3fcf), I think we should
consider adding an assertion along the lines of 0002 to HEAD. Perhaps adding a
new bufmgr.c function to avoid having to get the fields in the buffer tag we
don't care about. Or perhaps we should just promote the check to an elog, we
already call BufferGetBlockNumber(), using BufferGetTag() instead doesn't cost
much more.

Greetings,

Andres Freund

Attachments:

v1-0001-Fix-bulk-table-extension-when-copying-into-multip.patchtext/x-diff; charset=us-asciiDownload

From e202cf5c3cae0551ae8fabd2629b4f9c6a0734d5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 13 Oct 2023 10:54:16 -0700
Subject: [PATCH v1 1/2] Fix bulk table extension when copying into multiple
 partitions

When COPYing into a partitioned table that does now permit the use of
table_multi_insert(), we could error out with
  ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes

because BulkInsertState->next_free was not reset between partitions. This
problem occured only when not able to use table_multi_insert(), as a dedicated
BulkInsertState for each partition is used in that case.

The bug was introduced in 00d1e02be24, but it was hard to hit at that point as
commonly bulk relation extension is not used when not using
table_multi_insert(). It became more likely after 82a4edabd27, which expanded
the use of bulk extension.

To fix the bug, reset the bulk relation extension state in BulkInsertState in
ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very
similar issue.  Obviously the name is not quite correct, but there might be
external callers, and bulk insert state needs to be reset in precisely in the
situations that ReleaseBulkInsertStatePin() already needed to be called.

Medium term the better fix likely is to disallow reusing BulkInsertState
across relations.

Add a test that, without the fix, reproduces #18130 in most
configurations. The test also catches the problem fixed in b1ecb9b3fcf when
run with small shared_buffers.

Reported-by: Ivan Kolombet <enderstd@gmail.com>
Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us>
Analyzed-by: Andres Freund <andres@anarazel.de>
Bug: #18130
Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org
Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us
Backpatch: 16-
---
 src/backend/access/heap/heapam.c   | 11 +++++++++
 src/test/regress/expected/copy.out | 37 ++++++++++++++++++++++++++++++
 src/test/regress/sql/copy.sql      | 37 ++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 88a123d38a6..51eceaca6cf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1792,6 +1792,17 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 	if (bistate->current_buf != InvalidBuffer)
 		ReleaseBuffer(bistate->current_buf);
 	bistate->current_buf = InvalidBuffer;
+
+	/*
+	 * Despite the name, we also release information about bulk
+	 * inserts. Otherwise we can end up erroring out due to looking for free
+	 * space in ->next_free of one partition, even though ->next_free was set
+	 * when extending another partition. It's obviously also could be bad for
+	 * efficiency to look at existing blocks at offsets from another
+	 * partition, even if we don't error out.
+	 */
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 }
 
 
diff --git a/src/test/regress/expected/copy.out b/src/test/regress/expected/copy.out
index a5912c13a8c..b48365ec981 100644
--- a/src/test/regress/expected/copy.out
+++ b/src/test/regress/expected/copy.out
@@ -257,3 +257,40 @@ ERROR:  value too long for type character varying(5)
 \.
 invalid command \.
 drop table oversized_column_default;
+--
+-- Create partitioned table that does not allow bulk insertions, to test bugs
+-- related to the reuse of BulkInsertState across partitions (only done when
+-- not using bulk insert).  Switching between partitions often makes it more
+-- likely to encounter these bugs, so we just switch on roughly every insert
+-- by having an even/odd number partition and inserting evenly distributed
+-- data.
+--
+CREATE TABLE parted_si (
+  id int not null,
+  data text not null,
+  -- prevent use of bulk insert by having a volatile function
+  rand float8 not null default random()
+)
+PARTITION BY LIST((id % 2));
+CREATE TABLE parted_si_p_even PARTITION OF parted_si FOR VALUES IN (0);
+CREATE TABLE parted_si_p_odd PARTITION OF parted_si FOR VALUES IN (1);
+-- Test that bulk relation extension handles reusing a single BulkInsertState
+-- across partitions.  Without the fix applied, this reliably reproduces
+-- #18130 unless shared_buffers is extremely small (preventing any use use of
+-- bulk relation extension). See
+-- https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org
+-- https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us
+\set filename :abs_srcdir '/data/desc.data'
+COPY parted_si(id, data) FROM :'filename';
+-- An earlier bug (see commit b1ecb9b3fcf) could end up using a buffer from
+-- the wrong partition. This test is *not* guaranteed to trigger that bug, but
+-- does so when shared_buffers is small enough.  To test if we encountered the
+-- bug, check that the partition condition isn't violated.
+SELECT tableoid::regclass, id % 2 = 0 is_even, count(*) from parted_si GROUP BY 1, 2 ORDER BY 1;
+     tableoid     | is_even | count 
+------------------+---------+-------
+ parted_si_p_even | t       |  5000
+ parted_si_p_odd  | f       |  5000
+(2 rows)
+
+DROP TABLE parted_si;
diff --git a/src/test/regress/sql/copy.sql b/src/test/regress/sql/copy.sql
index 7fdb26d14f3..43d2e906dd9 100644
--- a/src/test/regress/sql/copy.sql
+++ b/src/test/regress/sql/copy.sql
@@ -283,3 +283,40 @@ copy oversized_column_default (col2) from stdin;
 copy oversized_column_default from stdin (default '');
 \.
 drop table oversized_column_default;
+
+
+--
+-- Create partitioned table that does not allow bulk insertions, to test bugs
+-- related to the reuse of BulkInsertState across partitions (only done when
+-- not using bulk insert).  Switching between partitions often makes it more
+-- likely to encounter these bugs, so we just switch on roughly every insert
+-- by having an even/odd number partition and inserting evenly distributed
+-- data.
+--
+CREATE TABLE parted_si (
+  id int not null,
+  data text not null,
+  -- prevent use of bulk insert by having a volatile function
+  rand float8 not null default random()
+)
+PARTITION BY LIST((id % 2));
+
+CREATE TABLE parted_si_p_even PARTITION OF parted_si FOR VALUES IN (0);
+CREATE TABLE parted_si_p_odd PARTITION OF parted_si FOR VALUES IN (1);
+
+-- Test that bulk relation extension handles reusing a single BulkInsertState
+-- across partitions.  Without the fix applied, this reliably reproduces
+-- #18130 unless shared_buffers is extremely small (preventing any use use of
+-- bulk relation extension). See
+-- https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org
+-- https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us
+\set filename :abs_srcdir '/data/desc.data'
+COPY parted_si(id, data) FROM :'filename';
+
+-- An earlier bug (see commit b1ecb9b3fcf) could end up using a buffer from
+-- the wrong partition. This test is *not* guaranteed to trigger that bug, but
+-- does so when shared_buffers is small enough.  To test if we encountered the
+-- bug, check that the partition condition isn't violated.
+SELECT tableoid::regclass, id % 2 = 0 is_even, count(*) from parted_si GROUP BY 1, 2 ORDER BY 1;
+
+DROP TABLE parted_si;
-- 
2.38.0

v1-0002-Assert-buffer-in-ReadBufferBI-is-for-correct-rela.patchtext/x-diff; charset=us-asciiDownload

From 9297c5f00ff3230e5faf1c7dad5de7d2d5868ec9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 13 Oct 2023 11:23:16 -0700
Subject: [PATCH v1 2/2] Assert buffer in ReadBufferBI() is for correct
 relation

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/hio.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index caa62708aa5..ed9f72379ba 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -99,6 +99,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
 	/* If we have the desired block already pinned, re-pin and return it */
 	if (bistate->current_buf != InvalidBuffer)
 	{
+#ifdef USE_ASSERT_CHECKING
+		{
+			RelFileLocator rlocator;
+			ForkNumber forknum;
+			BlockNumber blknum;
+
+			BufferGetTag(bistate->current_buf, &rlocator, &forknum, &blknum);
+			Assert(RelFileLocatorEquals(rlocator, relation->rd_locator));
+		}
+#endif
+
 		if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
 		{
 			/*
-- 
2.38.0

#54

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Andres Freund (#53)

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-10-13 11:30:35 -0700, Andres Freund wrote:

On 2023-10-13 10:39:10 -0700, Andres Freund wrote:

On 2023-10-12 09:24:19 -0700, Andres Freund wrote:

I kind of had hoped somebody would comment on the approach. Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

I spent some time working on a test that shows the problem more cheaply than
the case upthread. I think it'd be desirable to have a test that's likely to
catch an issue like this fairly quickly. We've had other problems in this
realm before - there's only a single test that fails if I remove the
ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all.

I'm a bit on the fence on how large to make the relation. For me the bug
triggers when filling both relations up to the 3rd block, but how many rows
that takes is somewhat dependent on space utilization on the page and stuff.

Right now the test uses data/desc.data and ends up with 328kB and 312kB in two
partitions. Alternatively I could make the test create a new file to load with
copy that has fewer rows than data/desc.data - I didn't see another data file
that works conveniently and has fewer rows. The copy is reasonably fast, even
under something as expensive as rr (~60ms). So I'm inclined to just go with
that?

Patch with fix and test attached (0001).

Pushed that.

Greetings,

Andres