postgresql-server exiting abnormally after an OS upgrade

Started by Hugo Osvaldo Barreraabout 11 years ago9 messagesbugs

hugo@barrera.io

about 11 years ago

Hi,

I've been running postgresql-server for some time on an OpenBSD box, and the
latest OS upgrade (which also upgraded from 9.3 to 9.4) seems to have broken it
in some way. I tried downgrading back to 9.3.6, but the issue persists, so it
looks like something else is broken, and not postgresql per se.

Of course, I did the proper upgrade procedure, pg_dumpall, move everything out
of the way, and reinit the db.

I also tried *not* reimporting the db on my last wipeout, and keeping the
completely empty database. It crashed too upon connection attempts (which fail
to authenticate because I haven't even created the users), so I guess the data
is no problem. The database is pretty small, and very few concurrent users (the
*whole* backup is ~400kB).

Here's the error which repeats ad infinitum:

LOG: database system was shut down at 2015-02-12 14:41:29 GMT
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 21205) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2015-02-12 14:42:04 GMT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/1690A90
LOG: redo is not required
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

I've searchd all over, and found little or nothing of help. Nothing seemed to
change the situation.

I believe that these settings *are* relevant, so:

kern.seminfo.semmni=256
kern.seminfo.semmns=512
kern.seminfo.semmnu=256
kern.seminfo.semmap=256

kern.shminfo.shmmax=134217728
kern.shminfo.shmseg=256

I also tried setting shared_buffers = 16MB, which, while small, discards
several potential configuration issues.

Note that before upgrading to the latest OS version, pg worked fine. Also
relevant, other users of the same release have confirmed not having this issue
and pg working fine for them.

Any pointer?

Thanks,

--
Hugo Osvaldo Barrera
A: Because we read from top to bottom, left to right.
Q: Why should I start my reply below the quoted text?

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 11 years ago

In reply to: Hugo Osvaldo Barrera (#1)

Re: postgresql-server exiting abnormally after an OS upgrade

On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote:

Here's the error which repeats ad infinitum:

LOG: database system was shut down at 2015-02-12 14:41:29 GMT
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 21205) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.

Can you get a core dump and a stack trace from it?

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hugo Osvaldo Barrera

hugo@barrera.io

about 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: server exiting abnormally after an OS upgrade

On 2015-02-13 15:29, Heikki Linnakangas wrote:

On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote:

Here's the error which repeats ad infinitum:

LOG: database system was shut down at 2015-02-12 14:41:29 GMT
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 21205) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.

Can you get a core dump and a stack trace from it?

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

I'm slightly unsure how to do this - the docs give some references on how to
select a single process and attempt to attach gdb to it, but they seem to be
too short lived for me to do anything (<1s between errors), so I can't really
work with that.

Do you have any suggestions for this?

I'll try to enable core dumps in the meantime and get back to you on this,
thank,

--
Hugo Osvaldo Barrera
A: Because we read from top to bottom, left to right.
Q: Why should I start my reply below the quoted text?

Hugo Osvaldo Barrera

hugo@barrera.io

about 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: postgresql-server exiting abnormally after an OS upgrade

On 2015-02-13 15:29, Heikki Linnakangas wrote:

On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote:

Here's the error which repeats ad infinitum:

LOG: database system was shut down at 2015-02-12 14:41:29 GMT
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 21205) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.

Can you get a core dump and a stack trace from it?

- Heikki

It certainly took me a bit, but, with some help, I managed to get one:

(gdb) bt
#0 0x0000110a2815b92a in kill () at <stdin>:2
#1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.c:53
#2 0x0000110a2816a238 in memcpy (dst0=0xfb8d4, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65
#3 0x000011080cf8d1b1 in check_ip (raddr=0x110a899f7918, addr=0x110a899f9058, mask=0x110a899f9158) at hba.c:704
#4 0x000011080cf90a04 in check_hba (port=0x110a899f7800) at hba.c:1718
#5 0x000011080cf91d34 in hba_getauthmethod (port=0x110a899f7800) at hba.c:2256
#6 0x000011080cf88eb3 in ClientAuthentication (port=0x110a899f7800) at auth.c:307
#7 0x000011080d1edf5d in PerformAuthentication (port=0x110a899f7800) at postinit.c:223
#8 0x000011080d1eeae7 in InitPostgres (in_dbname=0x110af4508c00 "virtstart-dev", dboid=0,
username=0x110af4508be0 "virtstart-dev", out_dbname=0x0) at postinit.c:688
#9 0x000011080d0a3eb1 in PostgresMain (argc=1, argv=0x110af4508c20, dbname=0x110af4508c00 "virtstart-dev",
username=0x110af4508be0 "virtstart-dev") at postgres.c:3749
#10 0x000011080d033537 in BackendRun (port=Could not find the frame base for "BackendRun".
) at postmaster.c:4155
#11 0x000011080d032be8 in BackendStartup (port=0x110a899f7800) at postmaster.c:3829
#12 0x000011080d02f2d0 in ServerLoop () at postmaster.c:1597
#13 0x000011080d02e968 in PostmasterMain (argc=3, argv=0x7f7ffffd9658) at postmaster.c:1244
#14 0x000011080cf96dc8 in main (argc=Could not find the frame base for "main".
) at main.c:228
Current language: auto; currently asm

I stopped pg once it crashed and then bt'd the core. Does this help at all? Is
there something else I can provide?

Thanks,

--
Hugo Osvaldo Barrera
A: Because we read from top to bottom, left to right.
Q: Why should I start my reply below the quoted text?

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Hugo Osvaldo Barrera (#4)

Re: postgresql-server exiting abnormally after an OS upgrade

Hugo Osvaldo Barrera <hugo@barrera.io> writes:

On 2015-02-13 15:29, Heikki Linnakangas wrote:

Can you get a core dump and a stack trace from it?

It certainly took me a bit, but, with some help, I managed to get one:

(gdb) bt
#0 0x0000110a2815b92a in kill () at <stdin>:2
#1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.c:53
#2 0x0000110a2816a238 in memcpy (dst0=0xfb8d4, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65
#3 0x000011080cf8d1b1 in check_ip (raddr=0x110a899f7918, addr=0x110a899f9058, mask=0x110a899f9158) at hba.c:704
#4 0x000011080cf90a04 in check_hba (port=0x110a899f7800) at hba.c:1718

Huh. You didn't say whether this is 9.3 or 9.4, but it doesn't matter
because that code is the same in both; the crash is happening here:

static bool
check_ip(SockAddr *raddr, struct sockaddr * addr, struct sockaddr * mask)
{
...
/*
* If we're connected on IPv6 but the file specifies an IPv4 address
* to match against, promote the latter to an IPv6 address before
* trying to match the client's address.
*/
struct sockaddr_storage addrcopy,
maskcopy;

memcpy(&addrcopy, &addr, sizeof(addrcopy));
---> memcpy(&maskcopy, &mask, sizeof(maskcopy));

Looking at this, the code seems wrong on its face: the memcpy sources
should be "addr" and "mask", not "&addr" and "&mask". Apparently, no one
has ever used or tested this case since it was put in :-(. I find that
a bit astonishing given the steady creep of IPv6 support, but it's hard
to see how the code could possibly work correctly as-is.

Probably the reason why this was triggered by an OS upgrade is that your
system is now choosing to make the DB connection over IPv6 not IPv4.
Or maybe it was doing that right along but you now have a memcpy that is
doing more checking of its arguments than before (this is evidently an
argument sanity check and not a straight SIGSEGV).

Anyway you could work around it for now by being sure to list IPv6
equivalent addresses before IPv4 addresses in your pg_hba.conf.
We'll fix it properly in the next releases.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hugo Osvaldo Barrera

hugo@barrera.io

about 11 years ago

In reply to: Tom Lane (#5)

Re: postgresql-server exiting abnormally after an OS upgrade

On 2015-02-16 15:55, Tom Lane wrote:

Hugo Osvaldo Barrera <hugo@barrera.io> writes:

On 2015-02-13 15:29, Heikki Linnakangas wrote:

Can you get a core dump and a stack trace from it?

It certainly took me a bit, but, with some help, I managed to get one:

(gdb) bt
#0 0x0000110a2815b92a in kill () at <stdin>:2
#1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.c:53
#2 0x0000110a2816a238 in memcpy (dst0=0xfb8d4, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65
#3 0x000011080cf8d1b1 in check_ip (raddr=0x110a899f7918, addr=0x110a899f9058, mask=0x110a899f9158) at hba.c:704
#4 0x000011080cf90a04 in check_hba (port=0x110a899f7800) at hba.c:1718

Huh. You didn't say whether this is 9.3 or 9.4, but it doesn't matter
because that code is the same in both; the crash is happening here:

It's 9.4, the issue appeared after an OS upgrade which was accompanied by an
upgrade to 9.3 I *did* try to downgrade back to 9.3 at that point, but got the
same issue.

static bool
check_ip(SockAddr *raddr, struct sockaddr * addr, struct sockaddr * mask)
{
...
/*
* If we're connected on IPv6 but the file specifies an IPv4 address
* to match against, promote the latter to an IPv6 address before
* trying to match the client's address.
*/
struct sockaddr_storage addrcopy,
maskcopy;

memcpy(&addrcopy, &addr, sizeof(addrcopy));
---> memcpy(&maskcopy, &mask, sizeof(maskcopy));

Looking at this, the code seems wrong on its face: the memcpy sources
should be "addr" and "mask", not "&addr" and "&mask". Apparently, no one
has ever used or tested this case since it was put in :-(. I find that
a bit astonishing given the steady creep of IPv6 support, but it's hard
to see how the code could possibly work correctly as-is.

Probably the reason why this was triggered by an OS upgrade is that your
system is now choosing to make the DB connection over IPv6 not IPv4.
Or maybe it was doing that right along but you now have a memcpy that is
doing more checking of its arguments than before (this is evidently an
argument sanity check and not a straight SIGSEGV).

My host has been configured to use IPv6 by default for some time now, but it
*is* possible that there has been some OS-level change during this upgrade that
affected this in some way.

Anyway you could work around it for now by being sure to list IPv6
equivalent addresses before IPv4 addresses in your pg_hba.conf.
We'll fix it properly in the next releases.

Do you mean IPv6 *before* IPv4? Wouldn't that add a preference to IPv6?
I currently have the stock out-of-the-box pg_hba:

# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5

(I deleted commented out lines for readability).

regards, tom lane

Thanks,

--
Hugo Osvaldo Barrera
A: Because we read from top to bottom, left to right.
Q: Why should I start my reply below the quoted text?

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Hugo Osvaldo Barrera (#6)

Re: postgresql-server exiting abnormally after an OS upgrade

Hugo Osvaldo Barrera <hugo@barrera.io> writes:

On 2015-02-16 15:55, Tom Lane wrote:

Probably the reason why this was triggered by an OS upgrade is that your
system is now choosing to make the DB connection over IPv6 not IPv4.
Or maybe it was doing that right along but you now have a memcpy that is
doing more checking of its arguments than before (this is evidently an
argument sanity check and not a straight SIGSEGV).

My host has been configured to use IPv6 by default for some time now, but it
*is* possible that there has been some OS-level change during this upgrade that
affected this in some way.

On further thought the answer almost certainly is that you now have a libc
version in which memcpy() complains strenuously (ie, aborts) if given
overlapping source and target areas. Such "helpfulness" is the new fad
among libc maintainers :-(

Anyway you could work around it for now by being sure to list IPv6
equivalent addresses before IPv4 addresses in your pg_hba.conf.
We'll fix it properly in the next releases.

Do you mean IPv6 *before* IPv4? Wouldn't that add a preference to IPv6?

No, not particularly, considering the cases are disjoint. An incoming
connection is one or the other.

I currently have the stock out-of-the-box pg_hba:

# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5

The problem comes when an incoming IPv6 connection is compared to that
127.0.0.1 line. If you're using only IPv6 anyway, you could just remove
that line. Or you could do something like this:

host all all ::1/128 md5
host all all ::0/0 reject
host all all 127.0.0.1/32 md5

so that all IPv6 connections are either accepted or rejected before
reaching the 127.0.0.1 line.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hugo Osvaldo Barrera

hugo@barrera.io

about 11 years ago

In reply to: Tom Lane (#7)

Fwd: Re: postgresql-server exiting abnormally after upgrade to -snapshot

Hi,

This comes as a follow-up on a recent issue I mentioned and was discussed on
this list. (See thread [1]/messages/by-id/20150216202138.GB20474@athena.barrera.io).

The included patch fixes the issue and postgres has been working fine for me
with it. See this thread[2]https://marc.info/?t=142367714200001&r=1&w=2 for details.

Cheers,

----- Forwarded message from Jérémie Courrèges-Anglas <jca@wxcvbn.org> -----

Date: Mon, 16 Feb 2015 23:21:05 +0100
From: Jérémie Courrèges-Anglas <jca@wxcvbn.org>
Subject: Re: postgresql-server exiting abnormally after upgrade to -snapshot
To: Hugo Osvaldo Barrera <hugo@barrera.io>
Cc: misc@openbsd.org
Mail-Followup-To: Hugo Osvaldo Barrera <hugo@barrera.io>, misc@openbsd.org

jca@wxcvbn.org (Jérémie Courrèges-Anglas) writes:

Please try the diff below. It fixes the "backwards memcpy" problem
easily noticeable with psql -h ::1.

Updated diff. Thanks to Stuart for reminding me that netmasks sa_len
values can be much surprising.
$OpenBSD$
--- src/backend/libpq/hba.c.orig	Mon Feb 16 21:53:21 2015
+++ src/backend/libpq/hba.c	Mon Feb 16 23:08:38 2015
@@ -700,8 +700,13 @@ check_ip(SockAddr *raddr, struct sockaddr * addr, stru
struct sockaddr_storage addrcopy,
maskcopy;
-		memcpy(&addrcopy, &addr, sizeof(addrcopy));
-		memcpy(&maskcopy, &mask, sizeof(maskcopy));
+		memcpy(&addrcopy, addr, sizeof(struct sockaddr_in));
+		/*
+		 * On some OSes, if mask is obtained from eg. getifaddrs(3), sa_len
+		 * can vary wildly. We already know that addr->sa_family == AF_INET,
+		 * so just use sizeof(struct sockaddr_in).
+		 */
+		memcpy(&maskcopy, mask, sizeof(struct sockaddr_in));
pg_promote_v4_to_v6_addr(&addrcopy);
pg_promote_v4_to_v6_mask(&maskcopy);
--
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE

----- End forwarded message -----

[1]: /messages/by-id/20150216202138.GB20474@athena.barrera.io
[2]: https://marc.info/?t=142367714200001&r=1&w=2

--
Hugo Osvaldo Barrera
A: Because we read from top to bottom, left to right.
Q: Why should I start my reply below the quoted text?

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Hugo Osvaldo Barrera (#8)

Re: Fwd: Re: postgresql-server exiting abnormally after upgrade to -snapshot

Hugo Osvaldo Barrera <hugo@barrera.io> writes:

The included patch fixes the issue and postgres has been working fine for me
with it. See this thread[2] for details.

Hm, I was wondering whether copying sizeof(struct sockaddr_storage) might
not be a risky choice. However, the conclusion on the -hackers thread

/messages/by-id/23855.1424136475@sss.pgh.pa.us

seems to be that we're going to eliminate the entire issue by just ripping
out this code section, lock stock n barrel.

Thanks for confirming that this fixes it for you, though --- at least
we now have a certain diagnosis of the problem.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs