COPY performance on Windows

Started by Ryohei Takahashi (Fujitsu)about 1 year ago15 messages
#1Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
1 attachment(s)

Hi,

I noticed that the COPY performance on PG 17.0 Windows is worse than PG 16.4.

* Environment
OS: Windows Server 2022
CPU: 22 core * 2CPU
Memory: 512GB
Storage: 700GB HDD

* Input data
10GB csv file

* Executed command
psql -c 'copy table from '\''C:\data.csv'\'' WITH csv'
(Only one psql command)

* Performance
PG 16.4: 405.2s
PG 17.0: 417.4s

* Analysis
I noticed that the commit 82a4edabd2 affects the performance.

The logic of mdzeroextend() is following.

if (numblocks > 8)
{
...
ret = FileFallocate(); // if HAVE_POSIX_FALLOCATE, call flloacate(), else pwrite()
...
}
else
{
...
ret = FileZero(); // call pwrite()
...
}

In XFS filesystem, switching fallocate() and pwritev() reduce performance.
So, 82a4edabd2 increased numblocks to call fallocate() if fallocate() is once called.

On the other hand, Windows does not have fallocate().
So, pwrite() is always called regardless of numblocks.
As a result, 82a4edabd2 just increased the numblocks to be written on Windows.

* Improvement
I think 82a4edabd2 is only effective for the HAVE_POSIX_FALLOCATE system.
So, I made the attached patch.

By applying the attached patch to PG 17.0, the copy result is 401.5s.

How do you think about this?

Regards,
Ryohei Takahashi

Attachments:

001-skip-increasing-numblocks-without-fallocate-system.patchapplication/octet-stream; name=001-skip-increasing-numblocks-without-fallocate-system.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7c662cdf46e..b83f3211aa2 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -297,8 +297,11 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 *   prevent future contention.
 		 * ---
 		 */
+
+#ifdef HAVE_POSIX_FALLOCATE
 		if (bistate)
 			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+#endif
 
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
@@ -426,7 +429,9 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+#ifdef HAVE_POSIX_FALLOCATE
 		bistate->already_extended_by += extend_by_pages;
+#endif
 	}
 
 	return buffer;
#2Aleksander Alekseev
aleksander@timescale.com
In reply to: Ryohei Takahashi (Fujitsu) (#1)
Re: COPY performance on Windows

Hi Ryohei,

Thanks for the patch. Here are my two cents.

I noticed that the COPY performance on PG 17.0 Windows is worse than PG 16.4.

[...]

By applying the attached patch to PG 17.0, the copy result is 401.5s.

So we are trading a potential 3.8% speedup in certain environments for
the increased code complexity due to a couple of added #ifdef's here.

If we really want to do this, firstly the patch should have detailed
comments in front of #ifdefs so that in 10+ years from now someone who
didn't read this thread would know what they are for.

Secondly, more detailed research should be made on how this patch
affects the performance on Windows depending on the software version
and particular choice of hardware. Perhaps what you found is not the
only and/or the most important bottleneck. Your patch may (or may not)
cause performance degradation in other setups.

Last but not least one should double check that this will not cause
performance degradation on *nix systems.

To be honest, personally I wouldn't bother because of 3.8% speedup at
best (for 10+% - maybe). This being said perhaps you and other people
on the mailing list (reviewers, committers) feel otherwise.

--
Best regards,
Aleksander Alekseev

#3Robert Haas
robertmhaas@gmail.com
In reply to: Ryohei Takahashi (Fujitsu) (#1)
Re: COPY performance on Windows

Hello Takahashi-san,

I am reluctant to draw conclusions about the general performance of
this patch from one example. It seems that the performance could
depend on many things: table size, column definitions, row width,
hardware, OS version, shared_buffers, max_wal_size. I don't think we
can say from your test here that performance is always worse on
Windows. If it is, then I agree that we should think of what to do
about it; but it seems possible to me that the behavior will be
different in other circumstances.

What I don't understand is why increasing the number of blocks should
be worse. The comment before the FileZero() call has a comment
explaining why a larger extension is thought to be better. If it's
wrong, we should try to figure out why it's wrong. But it seems quite
surprising that doing more work at once would be less efficient.
That's not usually how things work.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Robert Haas (#3)
RE: COPY performance on Windows

Hi,

Thank you for your reply.

I don't want to "speed up" the COPY command.
I just want to "prevent speed down" compared with PG16.

But anyway, my current analysis is not convincing.
So, I will do more analysis and get back to you.

Regards,
Ryohei Takahashi

#5Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Ryohei Takahashi (Fujitsu) (#4)
1 attachment(s)
RE: COPY performance on Windows

Hi,

I continuously investigate the performance problem of COPY on Windows.

I noticed that not only PG17.0 but also PG16.6 have performance problem compared to PG16.4.
The performance is 2.5%-5.8% worse, especially when the number of clients is 1 or 2.

I modified the performance measurement script of the thread in [1]/messages/by-id/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com.
* Enabled to run on Windows git bash
* Enabled to compare PG16.4, PG16.6 and PG17.0
* Increase the row number to 10 times (about 10GB)

I measured on Windows Server 2022 machine with 44 core CPU and 512GB memory.
The results are following.

* PG16.4
PG164: nclients = 1, time = 432
PG164: nclients = 2, time = 238
PG164: nclients = 4, time = 157
PG164: nclients = 8, time = 135
PG164: nclients = 16, time = 163
PG164: nclients = 32, time = 261
PG164: nclients = 64, time = 458
PG164: nclients = 128, time = 611
PG164: nclients = 256, time = 622

* PG16.6
PG166: nclients = 1, time = 444 (2.7% worse than PG16.4)
PG166: nclients = 2, time = 252 (5.8% worse than PG16.4)
PG166: nclients = 4, time = 156
PG166: nclients = 8, time = 135
PG166: nclients = 16, time = 163
PG166: nclients = 32, time = 261
PG166: nclients = 64, time = 458
PG166: nclients = 128, time = 612
PG166: nclients = 256, time = 621

* PG17.0
PG170: nclients = 1, time = 448 (3.7% worse than PG16.4)
PG170: nclients = 2, time = 244 (2.5% worse than PG16.4)
PG170: nclients = 4, time = 159
PG170: nclients = 8, time = 137
PG170: nclients = 16, time = 165
PG170: nclients = 32, time = 262
PG170: nclients = 64, time = 458
PG170: nclients = 128, time = 611
PG170: nclients = 256, time = 621

(1)
I attach the performance measurement script.
If you have a Windows environment, could you please reproduce the same performance problem?

(2)
The performance of PG16.6 and PG17.0 are worse than PG16.4.
So, I think the commits between August and September affects the performance.
I will analyze these commits.

[1]: /messages/by-id/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com

Regards,
Ryohei Takahashi

Attachments:

test.shapplication/octet-stream; name=test.shDownload
#6Thomas Munro
thomas.munro@gmail.com
In reply to: Ryohei Takahashi (Fujitsu) (#5)
2 attachment(s)
Re: COPY performance on Windows

On Thu, Dec 12, 2024 at 1:18 AM Ryohei Takahashi (Fujitsu)
<r.takahashi_2@fujitsu.com> wrote:

The performance of PG16.6 and PG17.0 are worse than PG16.4.
So, I think the commits between August and September affects the performance.
I will analyze these commits.

If it reproduces reliably, maybe git bisect? Do you have a profiler?
Can you show the system call trace for good and bad behaviour? But I
wonder if there might just be some weird code placement variation
causing arbitrary performance changes, because nothing is jumping out
of that version range when I look at it... How do other versions, .0,
.1, .2, .3 perform? What about 15.x?

Just by the way, in case you are interested in the broader topic of
bulk file extension, here are some ideas that might be worth trying
out on a serious Windows server (maybe later once the unexpected
regression is understood):

1. Those code paths finish up in pg_pwritev(), but it has a loop over
8kb writes on Windows. Does it help if we just make "zbuffer" bigger?
How big?
2. While pondering the goals of posix_fallocate(), I had a
realisation about how we might implement FileFallocate() on Windows.
Does this idea work? Well?

Experiment-grade patches attached.

Attachments:

0001-Use-bigger-writes-in-pg_pwrite_zeros-on-Windows.patchtext/x-patch; charset=US-ASCII; name=0001-Use-bigger-writes-in-pg_pwrite_zeros-on-Windows.patchDownload
From cc5c91f1e16ad1335cb2efda67576fa419476d2a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 12 Dec 2024 15:51:21 +1300
Subject: [PATCH] Use bigger writes in pg_pwrite_zeros() on Windows.

XXX Is this helpful?
---
 src/common/file_utils.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 398fe1c334d..0de8b5ffb31 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -687,8 +687,16 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size, off_t offset)
 {
-	static const PGIOAlignedBlock zbuffer = {{0}};	/* worth BLCKSZ */
-	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
+	/*
+	 * On Windows, pg_pwritev() isn't a system call, it's a loop.  It might be
+	 * worth wasting more memory on zero buffers to get fewer loops.
+	 */
+#ifdef WIN32
+	static const PGIOAlignedBlock zbuffer[8] = {{{0}}};
+#else
+	static const PGIOAlignedBlock zbuffer[1] = {{{0}}};
+#endif
+	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer[0])->data;
 	struct iovec iov[PG_IOV_MAX];
 	size_t		remaining_size = size;
 	ssize_t		total_written = 0;
@@ -703,13 +711,8 @@ pg_pwrite_zeros(int fd, size_t size, off_t offset)
 		{
 			size_t		this_iov_size;
 
+			this_iov_size = Min(remaining_size, sizeof(zbuffer));
 			iov[iovcnt].iov_base = zerobuf_addr;
-
-			if (remaining_size < BLCKSZ)
-				this_iov_size = remaining_size;
-			else
-				this_iov_size = BLCKSZ;
-
 			iov[iovcnt].iov_len = this_iov_size;
 			remaining_size -= this_iov_size;
 		}
-- 
2.39.5

0001-Implement-FileFallocate-for-Windows.patchtext/x-patch; charset=US-ASCII; name=0001-Implement-FileFallocate-for-Windows.patchDownload
From aa750fd3ba0eb4ee652bdb13a4564c33ede44e04 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 12 Dec 2024 17:32:31 +1300
Subject: [PATCH] Implement FileFallocate() for Windows.

XXX Does this work, and is it beneficial?
XXX Would the slight non-atomicity break any user in PostgreSQL?  I
doubt it...
---
 src/backend/storage/file/fd.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7c403fb360e..9b2a4a13f39 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2390,7 +2390,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
 int
 FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 {
-#ifdef HAVE_POSIX_FALLOCATE
+#if defined(HAVE_POSIX_FALLOCATE) || defined(WIN32)
 	int			returnCode;
 
 	Assert(FileIsValid(file));
@@ -2405,7 +2405,31 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
+#ifdef WIN32
+	{
+		off_t		old_size;
+		off_t		new_size;
+
+		/*
+		 * On Windows, files are not sparse by default, so ftruncate() can
+		 * allocate new disk blocks without writing through the page cache.
+		 */
+		old_size = lseek(VfdCache[file].fd, 0, SEEK_END);
+		if (old_size < 0)
+			return -1;
+		new_size = offset + amount;
+		if (new_size > old_size)
+			if (ftruncate(VfdCache[file].fd, new_size) < 0)
+				return -1;
+	}
+#else
+
+	/*
+	 * On Unix, files are usually sparse by default, so posix_fallocate() is
+	 * needed to allocate disk blocks without writing through the page cache.
+	 */
 	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+#endif
 	pgstat_report_wait_end();
 
 	if (returnCode == 0)
-- 
2.39.5

#7Vladlen Popolitov
v.popolitov@postgrespro.ru
In reply to: Ryohei Takahashi (Fujitsu) (#5)
Re: COPY performance on Windows

Ryohei Takahashi (Fujitsu) писал(а) 2024-12-11 15:18:

Hi,

I continuously investigate the performance problem of COPY on Windows.

I noticed that not only PG17.0 but also PG16.6 have performance problem
compared to PG16.4.
The performance is 2.5%-5.8% worse, especially when the number of
clients is 1 or 2.

I modified the performance measurement script of the thread in [1].
* Enabled to run on Windows git bash
* Enabled to compare PG16.4, PG16.6 and PG17.0
* Increase the row number to 10 times (about 10GB)

I measured on Windows Server 2022 machine with 44 core CPU and 512GB
memory.
The results are following.

(1)
I attach the performance measurement script.
If you have a Windows environment, could you please reproduce the same
performance problem?

(2)
The performance of PG16.6 and PG17.0 are worse than PG16.4.
So, I think the commits between August and September affects the
performance.
I will analyze these commits.

[1]
/messages/by-id/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com

Regards,
Ryohei Takahashi

COPY FROM code had changes with DEFAULT option and numbers conversion,
and you suspect disk read/writes also.
I propose to isolate read-write case. You could create RAM DISK in
Windows server
(Microsoft has documentation on its site how to configure it, googling I
found this but not tested it
https://learn.microsoft.com/ru-ru/archive/blogs/windowsinternals/how-to-create-a-ram-disk-in-windows-server
).
You could create database on RAM disk and make benchmarks.
You will exclude disk influence, and probably other changes affected
performance.
The number conversion patch was tested on many processors, but not on
all. Probably you run Windows
on one that has worse performance or other issues possible.

P.S. Could you point where is the performance measurement script you
mantioned in you email?
I will try to play with it.

--
Best regards,

Vladlen Popolitov.

#8Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Thomas Munro (#6)
RE: COPY performance on Windows

Hi,

Thank you for your reply.
I tried your patch and report in this e-mail.

1. Those code paths finish up in pg_pwritev(), but it has a loop over
8kb writes on Windows. Does it help if we just make "zbuffer" bigger?
How big?

This patch improves the performance.

I applied 0001-Use-bigger-writes-in-pg_pwrite_zeros-on-Windows.patch over REL_16_6.
I changed the value "zbuffer" from 2 to 32.
I measured with nclients = 1.

16.6: 453s
16.6 + patch (zbuffer = 2): 442s
16.6 + patch (zbuffer = 4): 434s
16.6 + patch (zbuffer = 8): 430s
16.6 + patch (zbuffer = 16): 429s
16.6 + patch (zbuffer = 32): 428s

Performance improved up to 8KB and remained stable after that.

2. While pondering the goals of posix_fallocate(), I had a
realisation about how we might implement FileFallocate() on Windows.
Does this idea work? Well?

This patch degrades the performance.

16.6: 453s
16.6 + patch: 479s

Regards,
Ryohei Takahashi

#9Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Vladlen Popolitov (#7)
RE: COPY performance on Windows

Hi,

Thank you for your interest in this thread.

You could create database on RAM disk and make benchmarks.

According to your advice, I created RAM disk and put input files and data directory on RAM disk.
But the result changed only a few seconds.

In this test case, the table is unlogged table and shared_buffers is enough.
So, I think the disk performance does not affect so much.

P.S. Could you point where is the performance measurement script you
mantioned in you email?
I will try to play with it.

Thank you.
Please use the "test.sh" in the following e-mail.
/messages/by-id/TY3PR01MB11891C0FD066F069B113A2376823E2@TY3PR01MB11891.jpnprd01.prod.outlook.com

Regards,
Ryohei Takahashi

#10Vladlen Popolitov
v.popolitov@postgrespro.ru
In reply to: Ryohei Takahashi (Fujitsu) (#9)
Re: COPY performance on Windows

Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10:

Hi

According to your advice, I created RAM disk and put input files and
data directory on RAM disk.
But the result changed only a few seconds.
In this test case, the table is unlogged table and shared_buffers is
enough.
So, I think the disk performance does not affect so much.

If test on RAM drive got the same result, it could mean, that other
operations affect performance (not disk).
It is only idea, that numeric conversion gives some increase in time due
to new functionality added.
I think, it could be checked, if table has text fields instead of
numeric - we could exclude numeric conversion
and have the same input-output operations (really more IO-operation, but
we need to compare)

Please use the "test.sh" in the following e-mail.
/messages/by-id/TY3PR01MB11891C0FD066F069B113A2376823E2@TY3PR01MB11891.jpnprd01.prod.outlook.com

OK, I will use it.

By the way, do you use prebuild Postgres versions for this test or
build it by yourself with the same options? I am going to use built
myself.

--
Best regards,

Vladlen Popolitov.

#11Vladlen Popolitov
v.popolitov@postgrespro.ru
In reply to: Ryohei Takahashi (Fujitsu) (#9)
Re: COPY performance on Windows

Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10:
Hi

Please use the "test.sh" in the following e-mail.
/messages/by-id/TY3PR01MB11891C0FD066F069B113A2376823E2@TY3PR01MB11891.jpnprd01.prod.outlook.com

I cannot reproduce your results. In all of my runs final result depends
on run order -
benchmark for first versin get higher time, than time is smaller,
f.e. my last run (in start time order, time is in seconds):
PG164: nclients = 1, time = 251
PG164: nclients = 2, time = 210
PG164: nclients = 4, time = 126
PG164: nclients = 8, time = 107
PG164: nclients = 16, time = 99
PG164: nclients = 32, time = 109
PG164: nclients = 64, time = 112
PG164: nclients = 128, time = 113
PG164: nclients = 256, time = 120
PG166: nclients = 1, time = 244
PG166: nclients = 2, time = 222
PG166: nclients = 4, time = 131
PG166: nclients = 8, time = 109
PG166: nclients = 16, time = 101
PG166: nclients = 32, time = 110
PG166: nclients = 64, time = 115
PG166: nclients = 128, time = 116
PG166: nclients = 256, time = 123
PG170: nclients = 1, time = 240
PG170: nclients = 2, time = 213
PG170: nclients = 4, time = 129
PG170: nclients = 8, time = 110
PG170: nclients = 16, time = 101
PG170: nclients = 32, time = 112
PG170: nclients = 64, time = 115
PG170: nclients = 128, time = 116
PG170: nclients = 256, time = 122

I slightly modified your script:
1) exclude creation of input files to the separate step to decrease
influence of system disk cache.
2) run PostgreSQL servers on separate PC (Windows 10, 11th Gen Intel(R)
Core(TM) i5-1135G7 @ 2.40GHz , RAM 16GB),
clients on separate PC
3) I added CHECKPOINT in the end of every COPY FROM to flush wal.
4) I used EDB build for Windows from their site. Unfortunatelly, they
distribute
files without debug symbols like other distributions, it does not help
during profiling.
5) I think, that better to decrease shared_buffers as small as possible
to measure all IO time,
but I used 25% of RAM.

My observations
1) for 1-2 clients read time decreases every run (independent on
Postgres version) -
looks like Windows disk cache (I think, HTFS system information like
btree of file locations,
not the input file itself) - it contradicts to your main point, that
17.0 version is slower.

2) 1-client - Postgres backend takes only 12% of CPU, the rest time it
waits kernel operations.

3) 16-256 clients - I have not made any analysis of multiprocessor
effect to time increase:
OS process implementation, waiting on PostgreSQL locks or spinlocks,
parallel access to one
input file or other factors.

Could you confirm, that you receive you results on all execution orders
(17.0 first and 17.0 last)?

--
Best regards,

Vladlen Popolitov.

#12Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Vladlen Popolitov (#11)
RE: COPY performance on Windows

Hi

Thank you for your advice and testing.

I think, it could be checked, if table has text fields instead of
numeric - we could exclude numeric conversion
and have the same input-output operations (really more IO-operation, but
we need to compare)

I changed the column from int to text.
The performance becomes worse in each version,
but the rate of the difference of duration did not change.

By the way, do you use prebuild Postgres versions for this test or
build it by yourself with the same options? I am going to use built
myself.

In the mail in 2024-12-16 12:09:03, I used the modules which I build by myself.
In the other mail, I used the modules which community provides at following.
https://www.postgresql.org/download/windows/

Could you confirm, that you receive you results on all execution orders
(17.0 first and 17.0 last)?

In my environment, the results do not depend on order.
(The performance of the order 16.4, 16.6, 17.0 is same as that of the order 17.0, 16.6, 16.4)

Regards,
Ryohei Takahashi

#13Vladlen Popolitov
v.popolitov@postgrespro.ru
In reply to: Ryohei Takahashi (Fujitsu) (#12)
Re: COPY performance on Windows

Ryohei Takahashi (Fujitsu) писал(а) 2024-12-19 16:13:

Hi

Thank you for your advice and testing.

I think, it could be checked, if table has text fields instead of
numeric - we could exclude numeric conversion
and have the same input-output operations (really more IO-operation,
but
we need to compare)

I changed the column from int to text.
The performance becomes worse in each version,
but the rate of the difference of duration did not change.

By the way, do you use prebuild Postgres versions for this test or
build it by yourself with the same options? I am going to use built
myself.

In the mail in 2024-12-16 12:09:03, I used the modules which I build by
myself.
In the other mail, I used the modules which community provides at
following.
https://www.postgresql.org/download/windows/

Could you confirm, that you receive you results on all execution
orders
(17.0 first and 17.0 last)?

In my environment, the results do not depend on order.
(The performance of the order 16.4, 16.6, 17.0 is same as that of the
order 17.0, 16.6, 16.4)

Regards,
Ryohei Takahashi

Hi!

I tested with text (CREATE UNLOGGED TABLE test (c TEXT STORAGE PLAIN)
WITH (autovacuum_enabled = off)).
Time is ~10% higher, but I also do not see strict dependence on version.

I think, antivirus can influence the performace, I did not switched it
off.

How can I help you in testing? Probably better to test and compare
result only with one quantity of clients.
For 1 client the read time of the input file has the big effect, for
many clients their concurrency overhead has
big effect. Maybe make the series of the runs with 2 or 4 clients only
and compare?

--
Best regards,

Vladlen Popolitov.

#14Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Vladlen Popolitov (#13)
RE: COPY performance on Windows

Hi,

I did more investigation on the COPY performance on Windows.

By using community distributed binaries, the COPY performance of PG16.6 and PG17.0 is worse than PG16.4.
However, by using the binaries build by myself, there are no difference.
So, it is not the problem about the source code after PG16.4.

On the other hand, I noticed that 82a4edabd2 is degrading COPY performance on Windows.

I will write the detail in the next e-mail.

Regards,
Ryohei Takahashi

#15Ryohei Takahashi (Fujitsu)
r.takahashi_2@fujitsu.com
In reply to: Ryohei Takahashi (Fujitsu) (#14)
RE: COPY performance on Windows

Hi,

I compared the performance between 94f9c08 (previous commit of 82a4edabd2) and 82a4edabd2 with nclient = 8.
The performance is degraded by 8%.

* 94f9c08 (previous commit of 82a4edabd2)
75sec

* 82a4edabd2
81sec

I looked the pg_stat_io view after COPY completed.

* 94f9c08 (previous commit of 82a4edabd2)
backend_type | object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits | evictions | reuses | fsyncs | fsync_time | stats_reset
client backend | relation | bulkwrite | 0 | 0 | 4408399 | 57207.667 | 0 | 0 | 4424783 | 43897.679000000004 | 8192 | 6559989 | 0 | 4408399 | | | 2025-01-14 13:57:29.162001+09

* 82a4edabd2
backend_type | object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits | evictions | reuses | fsyncs | fsync_time | stats_reset
client backend | relation | bulkwrite | 0 | 0 | 4408514 | 112722.993 | 0 | 0 | 4424898 | 33416.076 | 8192 | 5245654 | 0 | 4408514 | | | 2025-01-14 13:54:40.643054+09

By applying 82a4edabd2, extend_time is reduced from 43897ms to 33416ms.
It seems it is the effect of 82a4edabd2.

On the other hand, write_time is increased from 57207ms to 112722ms.
I think write_time is just the time for pwrite(), so I cannot understand why 82a4edabd2 affects write_time.

How do you think about this?

Regards,
Ryohei Takahashi