No core file generated after PostgresNode->start

Started by Andy Fanover 5 years ago10 messages
#1Andy Fan
zhihui.fan1213@gmail.com

Hi:

When I run make -C subscription check, then I see the following logs
in ./tmp_check/log/013_partition_publisher.log

2020-05-11 09:37:40.778 CST [69541] sub_viaroot WARNING: terminating
connection because of crash of another server process

2020-05-11 09:37:40.778 CST [69541] sub_viaroot DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit,
because another server process exited abnormally and possibly corrupted
shared memory.

However there is no core file generated. In my other cases(like start pg
manually with bin/postgres xxx) can generate core file successfully at
the same machine. What might be the problem for PostgresNode case?

I tried this modification, but it doesn't help.

--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -766,7 +766,7 @@ sub start
                # Note: We set the cluster_name here, not in
postgresql.conf (in
                # sub init) so that it does not get copied to standbys.
-               $ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir,
'-l',
+               $ret = TestLib::system_log('pg_ctl', "-c", '-D',
$self->data_dir, '-l',
                        $self->logfile, '-o', "--cluster-name=$name",
'start');
        }

Best Regards
Andy Fan

#2Andy Fan
zhihui.fan1213@gmail.com
In reply to: Andy Fan (#1)
Re: No core file generated after PostgresNode->start

On Mon, May 11, 2020 at 9:48 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Hi:

2020-05-11 09:37:40.778 CST [69541] sub_viaroot WARNING: terminating
connection because of crash of another server process

Looks this doesn't mean a crash. If the test case(subscription/t/

013_partition.pl)
failed, test framework kill some process, which leads the above message.
So you can
ignore this issue now. Thanks

Best Regards
Andy Fan

#3Robert Haas
robertmhaas@gmail.com
In reply to: Andy Fan (#2)
Re: No core file generated after PostgresNode->start

On Sun, May 10, 2020 at 11:21 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Looks this doesn't mean a crash. If the test case(subscription/t/013_partition.pl)
failed, test framework kill some process, which leads the above message. So you can
ignore this issue now. Thanks

I think there might be a real issue here someplace, though, because I
couldn't get a core dump last week when I did have a crash happening
locally. I didn't poke into it very hard though so I never figured out
exactly why not, but ulimit -c unlimited didn't help.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Antonin Houska
ah@cybertec.at
In reply to: Robert Haas (#3)
Re: No core file generated after PostgresNode->start

Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 10, 2020 at 11:21 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Looks this doesn't mean a crash. If the test case(subscription/t/013_partition.pl)
failed, test framework kill some process, which leads the above message. So you can
ignore this issue now. Thanks

I think there might be a real issue here someplace, though, because I
couldn't get a core dump last week when I did have a crash happening
locally. I didn't poke into it very hard though so I never figured out
exactly why not, but ulimit -c unlimited didn't help.

Could "sysctl kernel.core_pattern" be the problem? I discovered this setting
sometime when I also couldn't find the core dump on linux.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Antonin Houska (#4)
Re: No core file generated after PostgresNode->start

On Mon, May 11, 2020 at 4:24 PM Antonin Houska <ah@cybertec.at> wrote:

Could "sysctl kernel.core_pattern" be the problem? I discovered this setting
sometime when I also couldn't find the core dump on linux.

Well, I'm running on macOS and the core files normally show up in
/cores, but in this case they didn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: No core file generated after PostgresNode->start

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, May 11, 2020 at 4:24 PM Antonin Houska <ah@cybertec.at> wrote:

Could "sysctl kernel.core_pattern" be the problem? I discovered this setting
sometime when I also couldn't find the core dump on linux.

Well, I'm running on macOS and the core files normally show up in
/cores, but in this case they didn't.

I have a standing note to check the permissions on /cores after any macOS
upgrade, because every so often Apple decides that that directory ought to
be read-only.

regards, tom lane

#7Andy Fan
zhihui.fan1213@gmail.com
In reply to: Robert Haas (#3)
Re: No core file generated after PostgresNode->start

On Tue, May 12, 2020 at 3:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, May 10, 2020 at 11:21 PM Andy Fan <zhihui.fan1213@gmail.com>
wrote:

Looks this doesn't mean a crash. If the test case(subscription/t/

013_partition.pl)

failed, test framework kill some process, which leads the above

message. So you can

ignore this issue now. Thanks

I think there might be a real issue here someplace, though, because I
couldn't get a core dump last week when I did have a crash happening
locally.

I forget to say the failure happens on my modified version, I guess this is
what
happened in my case (subscription/t/013_partition.pl ).

1. It need to read data from slave, however it get ERROR, elog(ERROR, ..)
rather crash.
2. The test framework knows the case failed, so it kill the primary in
some way.
3. The primary raises the error below.

2020-05-11 09:37:40.778 CST [69541] sub_viaroot WARNING: terminating
connection because of crash of another server process

2020-05-11 09:37:40.778 CST [69541] sub_viaroot DETAIL: The postmaster
has commanded this server process to roll back the current transaction and
exit,
because another server process exited abnormally and possibly corrupted
shared memory.

Finally I get the root cause by looking into the error log in slave.
After I fix
my bug, the issue gone.

Best Regards
Andy Fan

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: No core file generated after PostgresNode->start

On Mon, May 11, 2020 at 10:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have a standing note to check the permissions on /cores after any macOS
upgrade, because every so often Apple decides that that directory ought to
be read-only.

Thanks, that was my problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#8)
Re: No core file generated after PostgresNode->start

On Tue, May 12, 2020 at 04:15:26PM -0400, Robert Haas wrote:

On Mon, May 11, 2020 at 10:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have a standing note to check the permissions on /cores after any macOS
upgrade, because every so often Apple decides that that directory ought to
be read-only.

Thanks, that was my problem.

Was that a recent problem with Catalina and/or Mojave? I have never
seen an actual problem up to 10.13.
--
Michael

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#9)
Re: No core file generated after PostgresNode->start

Michael Paquier <michael@paquier.xyz> writes:

On Tue, May 12, 2020 at 04:15:26PM -0400, Robert Haas wrote:

On Mon, May 11, 2020 at 10:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have a standing note to check the permissions on /cores after any macOS
upgrade, because every so often Apple decides that that directory ought to
be read-only.

Thanks, that was my problem.

Was that a recent problem with Catalina and/or Mojave? I have never
seen an actual problem up to 10.13.

I don't recall exactly when I started seeing this, but it was at least
a couple years back, so maybe Mojave. I think it's related to Apple's
efforts to make the root filesystem read-only. (It's not apparent to
me how come I can write in /cores when "mount" clearly reports

/dev/disk1s1 on / (apfs, local, read-only, journaled)

but nonetheless it works, as long as the directory permissions permit.)

regards, tom lane