REL_13_STABLE Windows 10 Regression Failures

Started by Heath Lordover 5 years ago9 messagesbugs
Jump to latest
#1Heath Lord
heath.lord@crunchydata.com

All,

I have been investigating an odd Windows regression failure that I encountered.

When building from source on a Windows 10 VM using MinGW (8.1.0), I
get a random number of regression failures off the REL_13_STABLE
branch. I debugged this a little bit and found out that the "random"
number of failures is fully dependent on the machine and if I disable
the "stats_ext.sql" regression test; all other tests pass without
issue. When the "stats_ext.sql" regression test runs, it causes a
database exception and PostgreSQL crashes. This is why the number of
regression failures is random: it depends upon the speed of the
machine that is running them. However, this behavior is not seen when
building off of master.

I did some digging and determined that on the REL_13_STABLE branch
this instability was introduced with this commit
"b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to
commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I
worked backwards from there to determine when the regressions stopped
failing and determined that with commit
"be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are
no longer failing.

I then checked out REL_13_STABLE and cherry picked the
"be0a6666656ec3f68eb7d8e7abab5139fcd47012" commit and ran a clean full
build of postgres. This build ran through with no issues and passed
all 200 tests.

I have also built using REL_12_STABLE and earlier with no regression failures.

This failure also does not seem to occur on a Server 2016 based system
as the buildfarm machines that run these builds do not see this
regression failure. I have not fully investigated why it appears to
be confined to Windows 10 or what the issue is that causes the PG
database to die during regression.

I was hoping that I could get commit
"be0a6666656ec3f68eb7d8e7abab5139fcd47012" backpatched to the
REL_13_STABLE branch so these regression failures would be resolved.

Thanks,
Heath

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heath Lord (#1)
Re: REL_13_STABLE Windows 10 Regression Failures

Heath Lord <heath.lord@crunchydata.com> writes:

When building from source on a Windows 10 VM using MinGW (8.1.0), I
get a random number of regression failures off the REL_13_STABLE
branch. I debugged this a little bit and found out that the "random"
number of failures is fully dependent on the machine and if I disable
the "stats_ext.sql" regression test; all other tests pass without
issue. When the "stats_ext.sql" regression test runs, it causes a
database exception and PostgreSQL crashes.

Hmph ... it's weird that we have not seen this in the buildfarm.
Have you tried to extract any info from the crash, like a stack trace?

I did some digging and determined that on the REL_13_STABLE branch
this instability was introduced with this commit
"b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to
commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I
worked backwards from there to determine when the regressions stopped
failing and determined that with commit
"be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are
no longer failing.

I'm having a hard time believing that b380484a8 would have introduced
a portability problem, and an even harder time believing that be0a66666
would have resolved it if so. What seems more likely is that there's
some underlying issue such as a memory stomp, that the first commit
accidentally exposed and the second one accidentally hid again.
So, even if back-patching be0a66666 seemed feasible from a stability
standpoint (which I don't think it is), I fear it'd just mask a problem
that would eventually bite us again.

So I think we need to dig down and try to identify the root cause,
without any preconceptions about how to fix it. Again, a stack trace
would be pretty useful. Or at least some info about which step of
stats_ext.sql is crashing.

regards, tom lane

#3Heath Lord
heath.lord@crunchydata.com
In reply to: Tom Lane (#2)
Re: REL_13_STABLE Windows 10 Regression Failures

Tom,
We are working to set up our environment to allow us to get a stack
trace as we do not have any of the Visual Studios stuff installed
right now. However, I thought I would send you a little more
information while we are trying to get that working.
Going through the stats_ext.sql file line by line with a freshly
built REL_13_STABLE database stood up we have determined that running
any of the following commands back to back will cause the database to
crash:

CREATE STATISTICS tst ON relnatts + relpages FROM pg_class;
CREATE STATISTICS tst ON (relpages, reltuples) FROM pg_class;

If you run another command in between them like:

SELECT version();

Then it will not crash when you run either of those commands again.
However if you run any combination of those 2 commands back to back it
will crash the database. The output from the psql instance after
stepping through the stats_ext.sql file is in the
stats_ext_psql_output.txt file attached.

The information from the postgres logfile for the above is attached
in the pg_logfile_output.txt file.

Hopefully, this will at least give you some information while we
are working on getting the backtrace. Thanks.

-Heath

Show quoted text

On Fri, Oct 30, 2020 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heath Lord <heath.lord@crunchydata.com> writes:

When building from source on a Windows 10 VM using MinGW (8.1.0), I
get a random number of regression failures off the REL_13_STABLE
branch. I debugged this a little bit and found out that the "random"
number of failures is fully dependent on the machine and if I disable
the "stats_ext.sql" regression test; all other tests pass without
issue. When the "stats_ext.sql" regression test runs, it causes a
database exception and PostgreSQL crashes.

Hmph ... it's weird that we have not seen this in the buildfarm.
Have you tried to extract any info from the crash, like a stack trace?

I did some digging and determined that on the REL_13_STABLE branch
this instability was introduced with this commit
"b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to
commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I
worked backwards from there to determine when the regressions stopped
failing and determined that with commit
"be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are
no longer failing.

I'm having a hard time believing that b380484a8 would have introduced
a portability problem, and an even harder time believing that be0a66666
would have resolved it if so. What seems more likely is that there's
some underlying issue such as a memory stomp, that the first commit
accidentally exposed and the second one accidentally hid again.
So, even if back-patching be0a66666 seemed feasible from a stability
standpoint (which I don't think it is), I fear it'd just mask a problem
that would eventually bite us again.

So I think we need to dig down and try to identify the root cause,
without any preconceptions about how to fix it. Again, a stack trace
would be pretty useful. Or at least some info about which step of
stats_ext.sql is crashing.

regards, tom lane

Attachments:

pg_logfile_output.txttext/plain; charset=US-ASCII; name=pg_logfile_output.txtDownload
stats_ext_psql_output.txttext/plain; charset=US-ASCII; name=stats_ext_psql_output.txtDownload
#4Heath Lord
heath.lord@crunchydata.com
In reply to: Heath Lord (#3)
Re: REL_13_STABLE Windows 10 Regression Failures

On Fri, Oct 30, 2020 at 3:47 PM Heath Lord <heath.lord@crunchydata.com> wrote:

Tom,
We are working to set up our environment to allow us to get a stack
trace as we do not have any of the Visual Studios stuff installed
right now. However, I thought I would send you a little more
information while we are trying to get that working.
Going through the stats_ext.sql file line by line with a freshly
built REL_13_STABLE database stood up we have determined that running
any of the following commands back to back will cause the database to
crash:

CREATE STATISTICS tst ON relnatts + relpages FROM pg_class;
CREATE STATISTICS tst ON (relpages, reltuples) FROM pg_class;

If you run another command in between them like:

SELECT version();

Then it will not crash when you run either of those commands again.
However if you run any combination of those 2 commands back to back it
will crash the database. The output from the psql instance after
stepping through the stats_ext.sql file is in the
stats_ext_psql_output.txt file attached.

The information from the postgres logfile for the above is attached
in the pg_logfile_output.txt file.

Hopefully, this will at least give you some information while we
are working on getting the backtrace. Thanks.

-Heath

On Fri, Oct 30, 2020 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heath Lord <heath.lord@crunchydata.com> writes:

When building from source on a Windows 10 VM using MinGW (8.1.0), I
get a random number of regression failures off the REL_13_STABLE
branch. I debugged this a little bit and found out that the "random"
number of failures is fully dependent on the machine and if I disable
the "stats_ext.sql" regression test; all other tests pass without
issue. When the "stats_ext.sql" regression test runs, it causes a
database exception and PostgreSQL crashes.

Hmph ... it's weird that we have not seen this in the buildfarm.
Have you tried to extract any info from the crash, like a stack trace?

I did some digging and determined that on the REL_13_STABLE branch
this instability was introduced with this commit
"b380484a850b6bf7d9fc0d85c555a2366e38451f"[1]. This corresponds to
commit "19f5a37b9fc48a12c77edafb732543875da2f4a3"[1] on master. I
worked backwards from there to determine when the regressions stopped
failing and determined that with commit
"be0a6666656ec3f68eb7d8e7abab5139fcd47012"[2] the regression tests are
no longer failing.

I'm having a hard time believing that b380484a8 would have introduced
a portability problem, and an even harder time believing that be0a66666
would have resolved it if so. What seems more likely is that there's
some underlying issue such as a memory stomp, that the first commit
accidentally exposed and the second one accidentally hid again.
So, even if back-patching be0a66666 seemed feasible from a stability
standpoint (which I don't think it is), I fear it'd just mask a problem
that would eventually bite us again.

So I think we need to dig down and try to identify the root cause,
without any preconceptions about how to fix it. Again, a stack trace
would be pretty useful. Or at least some info about which step of
stats_ext.sql is crashing.

regards, tom lane

All,
I was finally able to get a stack trace. I apologize for it taking
so long, but for some reason when I did a configure with
"--enable-cassert --enable-debug" all of the regression tests passed.
However I finally was able to get it to work with only using
"--enable-debug" and then using the mingw version of gdb to get what I
have attached. Please let me know if this contains any useful
information. I also noticed that causing the crash was slightly
different and it definitely behaved differently with debug enabled.
Thank you in advance for any help.

-Heath

Attachments:

back_trace.txttext/plain; charset=US-ASCII; name=back_trace.txtDownload
#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heath Lord (#4)
Re: REL_13_STABLE Windows 10 Regression Failures

On 2020-Nov-09, Heath Lord wrote:

Thread 1 received signal ?, Unknown signal.
[Switching to Thread 4440.0x39c]
0x00007ffbf387e598 in ntdll!RtlRaiseStatus () from C:\Windows\SYSTEM32\ntdll.dll
#0 0x00007ffbf387e598 in ntdll!RtlRaiseStatus () from C:\Windows\SYSTEM32\ntdll.dll
#1 0x00007ffbf37e6673 in ntdll!RtlUnwindEx () from C:\Windows\SYSTEM32\ntdll.dll
#2 0x00007ffbf0a734bd in msvcrt!_setjmpex () from C:\Windows\System32\msvcrt.dll
#3 0x000000000082535a in pg_re_throw () at elog.c:1717
#4 0x000000000082517a in errfinish (filename=<optimized out>, filename@entry=0x9595a7 <__func__.114045+79> "statscmds.c", lineno=lineno@entry=204, funcname=funcname@entry=0x959990 <__func__.111351> "CreateStatistics") at elog.c:502
#5 0x000000000056552c in CreateStatistics (stmt=0x4c05858, stmt@entry=0x4b6ee30) at statscmds.c:202

Hmm, line 202 is the ereport in this test:

if (!IsA(expr, ColumnRef))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only simple column references are allowed in CREATE STATISTICS")));

Not sure why that gives rise to the upper parts of the stack there.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: REL_13_STABLE Windows 10 Regression Failures

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Hmm, line 202 is the ereport in this test:

if (!IsA(expr, ColumnRef))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only simple column references are allowed in CREATE STATISTICS")));

Not sure why that gives rise to the upper parts of the stack there.

Yeah, it seems like the error-recovery longjmp has suddenly broken;
but why here? There's nothing unusual about this specific error case.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: REL_13_STABLE Windows 10 Regression Failures

High,

On 2020-11-09 12:18:34 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Hmm, line 202 is the ereport in this test:

if (!IsA(expr, ColumnRef))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only simple column references are allowed in CREATE STATISTICS")));

Not sure why that gives rise to the upper parts of the stack there.

Yeah, it seems like the error-recovery longjmp has suddenly broken;
but why here? There's nothing unusual about this specific error case.

Perhaps PG_exception_stack got corrupted somehow? An oversized write to
a neighboring var?

Not sure if that works on mingw, but building with address sanitizer /
asan might be informative.

Greetings,

Andres Freund

#8Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Andres Freund (#7)
Re: REL_13_STABLE Windows 10 Regression Failures

On Wed, Nov 11, 2020 at 2:32 AM Andres Freund <andres@anarazel.de> wrote:

High,

On 2020-11-09 12:18:34 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Hmm, line 202 is the ereport in this test:

if (!IsA(expr, ColumnRef))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only simple column references are allowed

in CREATE STATISTICS")));

Not sure why that gives rise to the upper parts of the stack there.

Yeah, it seems like the error-recovery longjmp has suddenly broken;
but why here? There's nothing unusual about this specific error case.

Perhaps PG_exception_stack got corrupted somehow? An oversized write to
a neighboring var?

Not sure if that works on mingw, but building with address sanitizer /
asan might be informative.

This looks like a known issue in MinGW64 builds [1]https://sourceforge.net/p/mingw-w64/bugs/406/, which is derived from
an also known issue in MSVC's handling of setjmp/longjmp [2]https://docs.microsoft.com/en-us/cpp/cpp/using-setjmp-longjmp.

[1]: https://sourceforge.net/p/mingw-w64/bugs/406/
[2]: https://docs.microsoft.com/en-us/cpp/cpp/using-setjmp-longjmp

Regards,

Juan José Santamaría Flecha

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juan José Santamaría Flecha (#8)
Re: REL_13_STABLE Windows 10 Regression Failures

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:

On Wed, Nov 11, 2020 at 2:32 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-11-09 12:18:34 -0500, Tom Lane wrote:

Yeah, it seems like the error-recovery longjmp has suddenly broken;
but why here? There's nothing unusual about this specific error case.

This looks like a known issue in MinGW64 builds [1], which is derived from
an also known issue in MSVC's handling of setjmp/longjmp [2].
[1] https://sourceforge.net/p/mingw-w64/bugs/406/
[2] https://docs.microsoft.com/en-us/cpp/cpp/using-setjmp-longjmp

Not sure I believe that. The trouble with diagnosing this as a generic
"setjmp is broken" situation is that then it's not very credible that the
regression tests got this far. As for [2], that's talking specifically
about longjmp not executing C++ destructors, which isn't an issue for us.

regards, tom lane