why is pg_upgrade's regression run so slow?

Started by Andrew Dunstanover 1 year ago10 messages
#1Andrew Dunstan
andrew@dunslane.net

As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the
normal regression tests. That in itself is sad, but what concerns me
here is why it's so much slower than the regular run? This is apparent
everywhere (e.g. on crake the standard run takes about 30 to 90 s, but
pg_upgrade's run takes 5 minutes or more). On Windows, it's
catastrophic, and only hasn't been noticed because the buildfarm client
wasn't counting a timeout as a failure. That was an error on my part and
I have switched a few of my machines to code that checks more robustly
for failure of meson tests - specifically by looking for the absence of
test.success rather than the presence of test.fail. That means that
drongo and fairywren are getting timeout errors. e.g. on the latest run
on fairywren, the regular regression run took 226s, but pg_upgrade's run
of what should be the same set of tests took 2418s. What the heck is
going on here? Is it because there are the concurrent tests running?
That doesn't seem enough to make the tests run more than 10 times as long.

I have a strong suspicion this is exacerbated by "debug_parallel_query =
regress", especially since the tests run much faster on REL_17_STABLE
where I am not setting that, but that can't be the whole explanation,
since that setting should apply to both sets of tests.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: why is pg_upgrade's regression run so slow?

Andrew Dunstan <andrew@dunslane.net> writes:

As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the
normal regression tests. That in itself is sad, but what concerns me
here is why it's so much slower than the regular run? This is apparent
everywhere (e.g. on crake the standard run takes about 30 to 90 s, but
pg_upgrade's run takes 5 minutes or more).

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals. For instance, sifaka's last run shows
00:09 for install-check-C and the same (within rounding error)
for the regression test step in 002_pgupgrade; on the much slower
mamba, the last run took 07:33 for install-check-C and 07:40 for the
002_pgupgrade regression test step.

I'm still using the makefile-based build, and I'm not using
debug_parallel_query, but it doesn't make a lot of sense to me
that either of those things should affect this.

Looking at crake's last run, there are other oddities: why does
the "check" step take 00:24 while "install-check-en_US.utf8" (which
should be doing strictly less work) takes 01:00? Again, there are
not similar discrepancies on my animals. Are you sure there's not
background load on the machine?

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: why is pg_upgrade's regression run so slow?

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

As part of its 002_pgupgrade.pl, the pg_upgrade tests do a rerun of the
normal regression tests. That in itself is sad, but what concerns me
here is why it's so much slower than the regular run? This is apparent
everywhere (e.g. on crake the standard run takes about 30 to 90 s, but
pg_upgrade's run takes 5 minutes or more).

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals. For instance, sifaka's last run shows
00:09 for install-check-C and the same (within rounding error)
for the regression test step in 002_pgupgrade; on the much slower
mamba, the last run took 07:33 for install-check-C and 07:40 for the
002_pgupgrade regression test step.

I'm still using the makefile-based build, and I'm not using
debug_parallel_query, but it doesn't make a lot of sense to me
that either of those things should affect this.

Looking at crake's last run, there are other oddities: why does
the "check" step take 00:24 while "install-check-en_US.utf8" (which
should be doing strictly less work) takes 01:00? Again, there are
not similar discrepancies on my animals. Are you sure there's not
background load on the machine?

Quite sure. Running crake and koel all it does. It's Fedora 40 running
on bare metal, a Lenovo Y70 with an Intel Core i7-4720HQ CPU @ 2.60GHz
and a Samsung SSD.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours. The regression test
times were consistent, including the installcheck tests.  That's
especially ugly on Windows as we don't have any alternative way of
running, and the results are so much more catastrophic.
"debug_parallel_query" didn't seem to matter.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: why is pg_upgrade's regression run so slow?

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours.

Interesting. Maybe meson is over-aggressively trying to run these
test suites in parallel?

regards, tom lane

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#4)
Re: why is pg_upgrade's regression run so slow?

On Sun, Jul 28, 2024 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Interesting. Maybe meson is over-aggressively trying to run these
test suites in parallel?

Hypothesis: NTFS might not be as good at linking/unlinking lots of
files concurrently due to forced synchronous I/O, causing queuing?

That's what [1]/messages/by-id/CA+hUKG+FXLcEg1dyTqJjDiNQ8pGom4KrJj4wF38C90thti9dVA@mail.gmail.com was about for FreeBSD CI. I couldn't immediately see
how to make a RAM disks on Windows but at the time I had a hunch that
that OS was struggling in the same way.

Could some tuning help? Disable 8dot3name (a thing that creates a
ye-olde-MSDOS-compatible second directory entry for every file),
adjust disablelastaccess (something like noatime), disable USN journal
(a kind of secondary journal of all file system operations that is
used to drive the change notification API that we don't care about),
disable write cache flush so that any synchronous I/O operations don't
wait for that (at the risk of corruption on power loss, but maybe it's
OK on a device dedicated to temporary workspace)? This is just from
some quick googling, but perhaps someone who actually knows how to
drive Windows locally and use the performance monitoring tools could
tell us what it's actually waiting on...

I noticed there is a new thing called Dev Drive[2]https://learn.microsoft.com/en-us/windows/dev-drive/ on Windows 11,
which claims to be faster for developer workloads and there are
graphs[3]https://devblogs.microsoft.com/visualstudio/devdrive/ showing various projects' test suites going faster. It's
ReFS, a COW file system. From some quick googling, the CopyFile()
system does a fast clone, and that should affect the robocopy command
in Cluster.pm (note: Unixoid cp in there also uses COW cloning on at
least xfs, zfs, probably apfs too). So I would be interested to know
if that goes faster ... or slower. I'm also interested in how it
reacts to the POSIX-semantics mode[4]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com; that might affect whether we
can ever pull the trigger on that idea.

[1]: /messages/by-id/CA+hUKG+FXLcEg1dyTqJjDiNQ8pGom4KrJj4wF38C90thti9dVA@mail.gmail.com
[2]: https://learn.microsoft.com/en-us/windows/dev-drive/
[3]: https://devblogs.microsoft.com/visualstudio/devdrive/
[4]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#4)
Re: why is pg_upgrade's regression run so slow?

On 2024-07-27 Sa 6:48 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours.

Interesting. Maybe meson is over-aggressively trying to run these
test suites in parallel?

Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on
HEAD for fairywren and drongo -  fairwren has just gone green, and I
expect drongo will when it reports in a few hours.

I'm at a loss for an explanation.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#7Alexander Lakhin
exclusion@gmail.com
In reply to: Andrew Dunstan (#6)
Re: why is pg_upgrade's regression run so slow?

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:

Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo -  fairwren has
just gone green, and I expect drongo will when it reports in a few hours.

I'm at a loss for an explanation.

I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     72.34s

test:         postgresql:regress / regress/regress
duration:     41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test:         postgresql:regress / regress/regress
duration:     50.08s

test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Best regards,
Alexander

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Alexander Lakhin (#7)
Re: why is pg_upgrade's regression run so slow?

On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:

Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on
HEAD for fairywren and drongo -  fairwren has just gone green, and I
expect drongo will when it reports in a few hours.

I'm at a loss for an explanation.

I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     72.34s

test:         postgresql:regress / regress/regress
duration:     41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test:         postgresql:regress / regress/regress
duration:     50.08s

test:         postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration:     398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r
testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Well, that last fact explains the discrepancy I originally complained
about. How the heck did that happen? It looks like we just ignored its
use in Makefile.global.in :-(

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#9Alexander Lakhin
exclusion@gmail.com
In reply to: Andrew Dunstan (#8)
1 attachment(s)
Re: why is pg_upgrade's regression run so slow?

Hello Andrew,

29.07.2024 13:54, Andrew Dunstan wrote:

On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Well, that last fact explains the discrepancy I originally complained about. How the heck did that happen? It looks
like we just ignored its use in Makefile.global.in :-(

Please look at the attached patch (partially based on ideas from [1]/messages/by-id/CAN55FZ304Kp+510-iU5-Nx6hh32ny9jgGn+OB5uqPExEMK1gQQ@mail.gmail.com) for
meson.build, that aligns it with `make` in regard to use of TEMP_CONFIG.

Maybe it could be implemented via a separate meson option, but that would
also require modifying at least the buildfarm client...

[1]: /messages/by-id/CAN55FZ304Kp+510-iU5-Nx6hh32ny9jgGn+OB5uqPExEMK1gQQ@mail.gmail.com

Best regards,
Alexander

Attachments:

add-support-for-TEMP_CONFIG-to-meson.build.patchtext/x-patch; charset=UTF-8; name=add-support-for-TEMP_CONFIG-to-meson.build.patchDownload
diff --git a/meson.build b/meson.build
index cd711c6d01..90f4e6da04 100644
--- a/meson.build
+++ b/meson.build
@@ -3388,6 +3388,12 @@ else
   runningcheck = false
 endif
 
+temp_config_env = run_command([
+  python,
+  '-c',
+  'import os; print(os.getenv("TEMP_CONFIG", ""))'
+], check: true).stdout().strip()
+
 testwrap = files('src/tools/testwrap')
 
 foreach test_dir : tests
@@ -3451,6 +3457,10 @@ foreach test_dir : tests
       env = test_env
       env.prepend('PATH', temp_install_bindir, test_dir['bd'])
 
+      if temp_config_env != ''
+        test_command_base += [ '--temp-config=' + temp_config_env ]
+      endif
+
       test_kwargs = {
         'protocol': 'tap',
         'priority': 10,
#10Andrew Dunstan
andrew@dunslane.net
In reply to: Alexander Lakhin (#9)
Re: why is pg_upgrade's regression run so slow?

On 2024-08-19 Mo 8:00 AM, Alexander Lakhin wrote:

Hello Andrew,

29.07.2024 13:54, Andrew Dunstan wrote:

On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

And another interesting fact is that TEMP_CONFIG is apparently
ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Well, that last fact explains the discrepancy I originally complained
about. How the heck did that happen? It looks like we just ignored
its use in Makefile.global.in :-(

Please look at the attached patch (partially based on ideas from [1]) for
meson.build, that aligns it with `make` in regard to use of TEMP_CONFIG.

Maybe it could be implemented via a separate meson option, but that would
also require modifying at least the buildfarm client...

[1]
/messages/by-id/CAN55FZ304Kp+510-iU5-Nx6hh32ny9jgGn+OB5uqPExEMK1gQQ@mail.gmail.com

I think this is the way to go. The patch LGTM.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com