Robocopy might be not robust enough for never-ending testing on Windows

Started by Alexander Lakhinover 1 year ago9 messageshackers

exclusion@gmail.com

over 1 year ago

Hello hackers,

While trying to reproduce inexplicable drongo failures (e. g., [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53)
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. After many (6-8 thousands) iterations of the
015_stream.pl TAP test, psql failed to start with a STATUS_DLL_INIT_FAILED
error, and a corresponding Windows popup preventing a test exit (see ss-1
in the archive attached).

Upon reaching that state of the system, following test runs fail with one
or another error related to memory (not only with psql, but also with the
server processes):
testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11 20:01:51.777 PDT [8812] LOG: server process
(PID 11532) was terminated by exception 0xC00000FD
testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11 20:01:19.806 PDT [2036] LOG: server process
(PID 10548) was terminated by exception 0xC00000FD
testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.513 PDT [9128] LOG: server process
(PID 14476) was terminated by exception 0xC0000409
testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11 20:03:27.801 PDT [10156] LOG: server process
(PID 2236) was terminated by exception 0xC0000409
testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.359 PDT [10656] LOG: server process
(PID 14712) was terminated by exception 0xC000012D
testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11 20:02:23.815 PDT [13704] LOG: server process
(PID 13992) was terminated by exception 0xC00000FD
testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.360 PDT [9760] LOG: server process
(PID 11608) was terminated by exception 0xC0000142
...

When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
attachment), while Physical Memory isn't all used up. To get OS to a
working state, I had to reboot it — killing processes, logoff/logon didn't
help.

I reproduced this issue again, investigated it and found out that it is
caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
leaking kernel objects, namely "Event objects" within Non-Paged pool on
each run.

This can be seen with Kernel Pool Monitor, or just with this simple PS script:
for ($i = 1; $i -le 100; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It shows to me:
iteration 1
               Total    Copied   Skipped Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1226063872
...
iteration 100
               Total    Copied   Skipped Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1245220864

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

Perhaps this information can be helpful for someone who is running
buildfarm/CI tests on Windows animals...

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53

Best regards,
Alexander

Andrew Dunstan

andrew@dunslane.net

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: Robocopy might be not robust enough for never-ending testing on Windows

On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:

Hello hackers,

While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. After many (6-8 thousands) iterations of the
015_stream.pl TAP test, psql failed to start with a
STATUS_DLL_INIT_FAILED
error, and a corresponding Windows popup preventing a test exit (see ss-1
in the archive attached).

Upon reaching that state of the system, following test runs fail with one
or another error related to memory (not only with psql, but also with the
server processes):
testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11
20:01:51.777 PDT [8812] LOG: server process (PID 11532) was
terminated by exception 0xC00000FD
testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11
20:01:19.806 PDT [2036] LOG: server process (PID 10548) was
terminated by exception 0xC00000FD
testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.513 PDT [9128] LOG: server process (PID 14476) was
terminated by exception 0xC0000409
testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11
20:03:27.801 PDT [10156] LOG: server process (PID 2236) was
terminated by exception 0xC0000409
testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.359 PDT [10656] LOG: server process (PID 14712) was
terminated by exception 0xC000012D
testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11
20:02:23.815 PDT [13704] LOG: server process (PID 13992) was
terminated by exception 0xC00000FD
testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11
19:59:41.360 PDT [9760] LOG: server process (PID 11608) was
terminated by exception 0xC0000142
...

When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
attachment), while Physical Memory isn't all used up. To get OS to a
working state, I had to reboot it — killing processes, logoff/logon
didn't
help.

I reproduced this issue again, investigated it and found out that it is
caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
leaking kernel objects, namely "Event objects" within Non-Paged pool on
each run.

This can be seen with Kernel Pool Monitor, or just with this simple PS
script:
for ($i = 1; $i -le 100; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | %
PoolNonpagedBytes
}

It shows to me:
iteration 1
               Total    Copied   Skipped Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1226063872
...
iteration 100
               Total    Copied   Skipped Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1245220864

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

Perhaps this information can be helpful for someone who is running
buildfarm/CI tests on Windows animals...

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53

Interesting, good detective work. Still, wouldn't this mean drongo would
fail consistently?

I wonder why we're using robocopy instead of our own RecursiveCopy module?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#2)

Re: Robocopy might be not robust enough for never-ending testing on Windows

Hello Andrew,

14.09.2024 17:22, Andrew Dunstan wrote:

On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:

While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. ...

Interesting, good detective work. Still, wouldn't this mean drongo would fail consistently?

Yes, I think drongo is suffering from another disease, we're yet to find
which.

I wonder why we're using robocopy instead of our own RecursiveCopy module?

AFAICS, robocopy is also used by regress.exe, so switching to the perl
module would require perl even for regress tests. I know that perl was
a requirement for MSVC builds, but maybe that's not so with meson...

Best regards,
Alexander

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: Robocopy might be not robust enough for never-ending testing on Windows

On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

I don't understand Windows but that seems pretty weird to me, as it
seems to imply that a driver or something fairly low level inside the
kernel is leaking objects (at least by simple minded analogies to
operating systems I understand better). Either that or robocop.exe
has userspace stuff involving at least one thread still running
somewhere after it's exited, but that seems unlikely as I guess you'd
have noticed that...

Just a thought: I was surveying the block cloning landscape across
OSes and filesystems while looking into clone-based CREATE DATABASE
(CF #4886) and also while thinking about the new TAP test initdb
template copy trick, is that robocopy.exe tries to use Windows' block
cloning magic, just like cp on recent Linux and FreeBSD systems (at
one point I was wondering if that was causing some funky extra flush
stalls on some systems, I need to come back to that...). It probably
doesn't actually work unless you have Windows 11 kernel with DevDrive
enabled (from reading, no Windows here), but I guess it still probably
uses the new system interfaces, probably something like CopyFileEx().
Does it still leak if you use /nooffload or /noclone?

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#4)

Re: Robocopy might be not robust enough for never-ending testing on Windows

On Sun, Sep 15, 2024 at 8:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:

template copy trick, is that robocopy.exe tries to use Windows' block

Erm, sorry for my fumbled early morning message editing and garbled
English... I meant to write "one thing I noticed is that..".

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Thomas Munro (#4)

Re: Robocopy might be not robust enough for never-ending testing on Windows

Hello Thomas,

14.09.2024 23:32, Thomas Munro wrote:

On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

I don't understand Windows but that seems pretty weird to me, as it
seems to imply that a driver or something fairly low level inside the
kernel is leaking objects (at least by simple minded analogies to
operating systems I understand better). Either that or robocop.exe
has userspace stuff involving at least one thread still running
somewhere after it's exited, but that seems unlikely as I guess you'd
have noticed that...

Yes, I see no robocopy process left after the test, and I think userspace
threads would not survive logoff.

Just a thought: I was surveying the block cloning landscape across
OSes and filesystems while looking into clone-based CREATE DATABASE
(CF #4886) and also while thinking about the new TAP test initdb
template copy trick, is that robocopy.exe tries to use Windows' block
cloning magic, just like cp on recent Linux and FreeBSD systems (at
one point I was wondering if that was causing some funky extra flush
stalls on some systems, I need to come back to that...). It probably
doesn't actually work unless you have Windows 11 kernel with DevDrive
enabled (from reading, no Windows here), but I guess it still probably
uses the new system interfaces, probably something like CopyFileEx().
Does it still leak if you use /nooffload or /noclone?

I tested the following (with the script above):
Windows 10 (Version 10.0.19045.4780):
robocopy.exe (10.0.19041.4717) /NOOFFLOAD
iteration 1
496611328
...
iteration 1000
609701888

That is, it leaks

/NOCLONE is not supported by that robocopy version:
ERROR : Invalid Parameter #1 : "/NOCLONE"

Then, Windows 11 (Version 10.0.22000.613), robocopy 10.0.22000.469:
iteration 1
141217792
...
iteration 996
151670784
...
iteration 997
152817664
...
iteration 1000
151674880

That is, it doesn't leak.

robocopy.exe /NOOFFLOAD
iteration 1
152666112
...
iteration 1000
153341952

No leak.

/NOCLONE is not supported by that robocopy version:

Then I updated that Windows 11 to Version 10.0.22000.2538 (with KB5031358),
robocopy 10.0.22000.1516:
iteration 1
122753024
...
iteration 1000
244674560

It does leak.

robocopy /NOOFFLOAD
iteration 1
167522304
...
iteration 1000
283484160

It leaks as well.

Finally, I've installed newest Windows 11 Version 10.0.22631.4169, with
robocopy 10.0.22621.3672:
Non-paged pool increased from 133 to 380 MB after 1000 robocopy runs.

robocopy /OFFLOAD leaks too.

/NOCLONE is not supported by that robocopy version:

So this leak looks like a recent and still existing defect.

(Sorry for a delay, fighting with OS updates/installation took me a while.)

Best regards,
Alexander

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#6)

Re: Robocopy might be not robust enough for never-ending testing on Windows

On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

So this leak looks like a recent and still existing defect.

From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/). And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles. But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off. If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1]: https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Thomas Munro (#7)

Re: Robocopy might be not robust enough for never-ending testing on Windows

Hello Thomas,

17.09.2024 04:01, Thomas Munro wrote:

On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

So this leak looks like a recent and still existing defect.

From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]). And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles. But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off. If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1] https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/

That's very interesting reading. I'll try to research the issue that deep
later (though I guess this case is different — after logging off and
logging in as another user, I can't see any processes belonging to the
first one, while those "Event objects" in non-paged pool still occupy
memory), but finding a Windows expert who perhaps can look at the
robocopy's sources, would be good too (and more productive).

So, the repro we can show is:
rm -r c:\temp\source
mkdir c:\temp\source
for ($i = 1; $i -le 1000; $i++)
{
echo 1 > "c:\temp\source\$i"
}

for ($i = 1; $i -le 1000; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\source c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It produces for me (on Windows 10 [Version 10.0.19045.4780]):
iteration 1
...
216887296
...
iteration 1000

------------------------------------------------------------------------------

               Total    Copied   Skipped Mismatch    FAILED Extras
    Dirs :         1         1         0         0         0 0
   Files :      1000      1000         0         0         0 0
   Bytes :     7.8 k     7.8 k         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00

   Speed :               17660 Bytes/sec.
   Speed :               1.010 MegaBytes/min.
   Ended : Monday, September 16, 2024 8:58:09 PM

365080576

Just "touch c:\temp\source\$i" is not enough, files must be non-empty for
the leak to happen.

Best regards,
Alexander

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#8)

Re: Robocopy might be not robust enough for never-ending testing on Windows

On Tue, Sep 17, 2024 at 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

but finding a Windows expert who perhaps can look at the
robocopy's sources, would be good too (and more productive).

number_of_kernel_bugs_found_by_pgsql_hackers++;

I am reliably informed that a kernel bug fix is in the pipeline and
should reach Windows 10 in ~3 months.

Robocopy might be not robust enough for never-ending testing on Windows

Attachments: