PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos
Hi All,
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8
No error on x86_64 (RH, Centos and SUSE)
from the log file
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv6 address "::", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-09 12:30:10 UTC pid:204 xid:0 ip: LOG: database system was shut down at 2019-04-09 12:27:09 UTC
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: database system is ready to accept connections
2019-04-09 12:31:46 UTC pid:203 xid:0 ip: LOG: received SIGHUP, reloading configuration files
2019-04-09 12:35:10 UTC pid:205 xid:0 ip: PANIC: could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: checkpointer process (PID 205) was terminated by signal 6: Aborted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: terminating any other active server processes
2019-04-09 12:35:10 UTC pid:208 xid:0 ip: WARNING: terminating connection because of crash of another server process
2019-04-09 12:35:10 UTC pid:208 xid:0 ip: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-09 12:35:10 UTC pid:208 xid:0 ip: HINT: In a moment you should be able to reconnect to the database and repeat your command.
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: all server processes terminated; reinitializing
2019-04-09 12:35:10 UTC pid:224 xid:0 ip: LOG: database system was interrupted; last known up at 2019-04-09 12:30:10 UTC
2019-04-09 12:35:10 UTC pid:224 xid:0 ip: PANIC: could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: startup process (PID 224) was terminated by signal 6: Aborted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: aborting startup due to startup process failure
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: database system is shut down
from pg_config
pg_config output
BINDIR = /usr/local/postgres/11/bin
DOCDIR = /usr/local/postgres/11/share/doc
HTMLDIR = /usr/local/postgres/11/share/doc
INCLUDEDIR = /usr/local/postgres/11/include
PKGINCLUDEDIR = /usr/local/postgres/11/include
INCLUDEDIR-SERVER = /usr/local/postgres/11/include/server
LIBDIR = /usr/local/postgres/11/lib
PKGLIBDIR = /usr/local/postgres/11/lib
LOCALEDIR = /usr/local/postgres/11/share/locale
MANDIR = /usr/local/postgres/11/share/man
SHAREDIR = /usr/local/postgres/11/share
SYSCONFDIR = /usr/local/postgres/etc
PGXS = /usr/local/postgres/11/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-tclconfig=/usr/lib64' '--with-perl' '--with-python' '--with-tcl' '--with-openssl' '--with-pam' '--with-gssapi' '--enable-nls' '--with-libxml' '--with-libxslt' '--with-ldap' '--prefix=/usr/local/postgres/11' 'CFLAGS=-O3 -g -pipe -Wall -D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -m64 -mcpu=power8 -mtune=power8 -DLINUX_OOM_SCORE_ADJ=0' '--with-libs=/usr/lib' '--with-includes=/usr/include' '--with-uuid=e2fs' '--sysconfdir=/usr/local/postgres/etc' '--with-llvm' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O3 -g -pipe -Wall -D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -m64 -mcpu=power8 -mtune=power8 -DLINUX_OOM_SCORE_ADJ=0
CFLAGS_SL = -fPIC
LDFLAGS = -L/usr/local/lib -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/usr/local/postgres/11/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 11.2
I get the feeling this is related to the fsync() issue.
why is it happening on Power RH and CentOS, but not on the other platforms?
Let me know if i need to provide any more information.
Reiner
Hi,
On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv6 address "::", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-09 12:30:10 UTC pid:204 xid:0 ip: LOG: database system was shut down at 2019-04-09 12:27:09 UTC
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: database system is ready to accept connections
2019-04-09 12:31:46 UTC pid:203 xid:0 ip: LOG: received SIGHUP, reloading configuration files
2019-04-09 12:35:10 UTC pid:205 xid:0 ip: PANIC: could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: checkpointer process (PID 205) was terminated by signal 6: Aborted
Any chance you can strace this? Because I don't understand how you'd get
a permission error here.
I get the feeling this is related to the fsync() issue.
why is it happening on Power RH and CentOS, but not on the other platforms?
Yea, the PANIC is due to various OSs, including linux, basically feeling
free to discard any diryt data after any integrity related calls fail
(we could narrow it down, but it's hard, given the variability between
versions). That is, if they signal such issues at all :(
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8
Any chance you can strace this? Because I don't understand how you'd get
a permission error here.
What kind of filesystem are the database files on?
regards, tom lane
On Sat, Apr 13, 2019 at 7:23 AM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8
Huh, I wonder what is different. I don't see this on EDB's CentOS
7.1 POWER8 system with an XFS filesystem. I ran it under strace -f
and saw this:
[pid 51614] sync_file_range2(0x19, 0x2, 0x8000, 0x2000, 0x2, 0x8) = 0
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv6 address "::", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-09 12:30:10 UTC pid:204 xid:0 ip: LOG: database system was shut down at 2019-04-09 12:27:09 UTC
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: database system is ready to accept connections
2019-04-09 12:31:46 UTC pid:203 xid:0 ip: LOG: received SIGHUP, reloading configuration files
2019-04-09 12:35:10 UTC pid:205 xid:0 ip: PANIC: could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: checkpointer process (PID 205) was terminated by signal 6: AbortedAny chance you can strace this? Because I don't understand how you'd get
a permission error here.
Me neither. I hacked my tree so that it would use the msync() version
instead of the sync_file_range() version but that worked too.
--
Thomas Munro
https://enterprisedb.com
sent by smoke signals at great danger to my self.
On 12 Apr 2019, at 23:16, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Apr 13, 2019 at 7:23 AM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8Huh, I wonder what is different. I don't see this on EDB's CentOS
7.1 POWER8 system with an XFS filesystem. I ran it under strace -f
and saw this:[pid 51614] sync_file_range2(0x19, 0x2, 0x8000, 0x2000, 0x2, 0x8) = 0
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on IPv6 address "::", port 5432
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-09 12:30:10 UTC pid:204 xid:0 ip: LOG: database system was shut down at 2019-04-09 12:27:09 UTC
2019-04-09 12:30:10 UTC pid:203 xid:0 ip: LOG: database system is ready to accept connections
2019-04-09 12:31:46 UTC pid:203 xid:0 ip: LOG: received SIGHUP, reloading configuration files
2019-04-09 12:35:10 UTC pid:205 xid:0 ip: PANIC: could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC pid:203 xid:0 ip: LOG: checkpointer process (PID 205) was terminated by signal 6: AbortedAny chance you can strace this? Because I don't understand how you'd get
a permission error here.Me neither. I hacked my tree so that it would use the msync() version
instead of the sync_file_range() version but that worked too.--
Thomas Munro
https://enterprisedb.com
I forgot to mention that this is happening in a docker container.
I want to test it on a VM to see if it is container related. I am sick at the moment so i’m unable to do the test at the moment.
Reiner
On Fri, Apr 12, 2019 at 08:04:00PM +0200, reiner peterke wrote:
We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS). No error on SUSE on power8No error on x86_64 (RH, Centos and SUSE)
So there's an error on power8 with RH but not SUSE.
What kernel versions are used for each of the successful and not successful ?
Justin
On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
I forgot to mention that this is happening in a docker container.
Huh, so there may be some configuration of Linux container that can
fail here with EPERM, even though that error that does not appear in
the man page, and doesn't make much intuitive sense. Would be good to
figure out how that happens.
If we could somehow confirm* that sync_file_range() with the
non-waiting flags we are using is non-destructive of error state, as
Andres speculated (that is, it cannot eat the only error report we're
ever going to get to tell us that buffered dirty data may have been
dropped), then I suppose we could just remove the data_sync_elevel()
promotion here. As with the WSL case (before the PANIC commit and the
subsequent don't-repeat-the-warning-forever patch), a user of this
posited EPERM-generating container configuration would then get
repeated warnings in the log forever (as they presumably did before).
Repeated WARNING messages are probably OK here, I think... I mean, if,
say, someone complains that FlubOS's Linux emulation fails here with
EIEIO, I'd say they should put up with the warnings and complain over
on the flub-hackers list, or whatever, and I'd say the same for
containers that generate EPERM: either the man page or the containter
technology needs work.
But... I still think we should try to avoid making decisions based on
knowledge of kernel implementation details, if it can be avoided. I'd
probably rather treat EPERM explicitly differently (and eventually
EIEIO too, if a report comes in) than drop the current paranoid coding
completely.
*I'm not looking at it myself. A sync_file_range() implementation is
on my list of potential FreeBSD projects for a rainy day, so I don't
want to study anything but the man page, even if it's wrong.
--
Thomas Munro
https://enterprisedb.com
On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
I forgot to mention that this is happening in a docker container.
Huh, so there may be some configuration of Linux container that can
fail here with EPERM, even though that error that does not appear in
the man page, and doesn't make much intuitive sense. Would be good to
figure out how that happens.
Steve Dodd ran into the same problem in Borg[1]https://lists.freedesktop.org/archives/systemd-devel/2019-August/043276.html. It looks like what's
happening here is that on PowerPC and ARM systems, there is a second
system call sync_file_range2 that has the arguments arranged in a
better order for their calling conventions (see Notes section of man
sync_file_range), and glibc helpfully translates for you, but some
container technologies forgot to include sync_file_range2 in their
syscall forwarding table. Perhaps we should just handle this with the
not_implemented_by_kernel mechanism I added for WSL.
[1]: https://lists.freedesktop.org/archives/systemd-devel/2019-August/043276.html
--
Thomas Munro
https://enterprisedb.com
On Mon, Aug 19, 2019 at 7:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
I forgot to mention that this is happening in a docker container.
Huh, so there may be some configuration of Linux container that can
fail here with EPERM, even though that error that does not appear in
the man page, and doesn't make much intuitive sense. Would be good to
figure out how that happens.Steve Dodd ran into the same problem in Borg[1]. It looks like what's
happening here is that on PowerPC and ARM systems, there is a second
system call sync_file_range2 that has the arguments arranged in a
better order for their calling conventions (see Notes section of man
sync_file_range), and glibc helpfully translates for you, but some
container technologies forgot to include sync_file_range2 in their
syscall forwarding table. Perhaps we should just handle this with the
not_implemented_by_kernel mechanism I added for WSL.
I've just heard that it was fixed overnight in seccomp, which is
probably what Docker is using to give you EPERM for syscalls it
doesn't like the look of:
https://github.com/systemd/systemd/pull/13352/commits/90ddac6087b5f8f3736364cfdf698e713f7e8869
Not being a Docker user, I'm sure if/when that will flow into the
right places in a timely fashion but if not it looks like you can
always configure your own profile or take one from somewhere else,
probably something like this:
https://github.com/moby/moby/commit/52d8f582c331e35f7b841171a1c22e2d9bbfd0b8
So it looks like we don't need to do anything at all on our side,
unless someone knows better.
--
Thomas Munro
https://enterprisedb.com