[Bus error] huge_pages default value (try) not fall back

Started by Fan Liuabout 6 years ago11 messagesbugs
Jump to latest
#1Fan Liu
fan.liu@ericsson.com

Hi,

We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.
But we also want to know why huge_pages default value(try) didn't fall back.

K8s BUG https://github.com/kubernetes/kubernetes/issues/71233

Problem quick summary:
When hugepage not working, initdb produce bus error.

Logs:
2020-02-17 06:33:21,606 INFO: trying to bootstrap a new cluster
2020-02-17 06:33:21,610 INFO: pg_ctl args: ('-o', '--auth-host=md5 --auth-local=trust --encoding=UTF8 --locale=en_US.UTF-8 --data-checksums --username=postgres --pwfile=/tmp/tmpcdHEH3'), {}
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /var/lib/postgresql/data/pgdata ... ok
creating subdirectories ... ok
sh: line 1: 100 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 102 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 104 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 106 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 108 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
selecting default max_connections ... 20
sh: line 1: 110 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 112 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 114 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 116 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 118 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 120 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 122 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 124 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 126 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 128 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 130 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 132 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 134 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 136 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 138 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 140 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 142 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 144 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1: 146 Bus error (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
selecting default shared_buffers ... 400kB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
child process was terminated by signal 7: Bus error
initdb: removing contents of data directory "/var/lib/postgresql/data/pgdata"
pg_ctl: database system initialization failed
running bootstrap script ... 2020-02-17 06:33:22,254 INFO: removing initialize key after failed attempt to bootstrap the cluster
------------ end of log -------------

Hugepage:

Output from "kubectl describe node"
========================
Capacity:
cpu: 56
ephemeral-storage: 365912640Ki
hugepages-1Gi: 16Gi
hugepages-2Mi: 0
memory: 131922340Ki
pods: 110
Allocatable:
cpu: 55900m
ephemeral-storage: 337225088466
hugepages-1Gi: 16Gi
hugepages-2Mi: 0
memory: 114792724Ki
pods: 110

========================
Grub command line:
GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"

========================

BRs,
Fan Liu
ADP Document Database PG

#2Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Fan Liu (#1)
Re: [Bus error] huge_pages default value (try) not fall back

On Tue, Feb 18, 2020 at 07:52:50AM +0000, Fan Liu wrote:
Hi,

We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.
But we also want to know why huge_pages default value(try) didn't fall back.

K8s BUG https://github.com/kubernetes/kubernetes/issues/71233

Problem quick summary:
When hugepage not working, initdb produce bus error.

Thanks for reporting!

This one is fun. If I understand everything correctly, Postgres will
fall back to non huge pages if it fails to allocate some. But in this
case kernel actually allocates everything without problems (there are
some available huge pages on a node after all), and return SIGBUS only
when a first page fault within this cgroup happened, see the docs [1]https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt:

The HugeTLB controller allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since
HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if
it tries to access HugeTLB pages beyond its limit. This requires the
application to know beforehand how much HugeTLB pages it would
require for its use.

Unfortunately I'm not sure what would be the best solution in this
situation.

[1]: https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt

#3Fan Liu
fan.liu@ericsson.com
In reply to: Dmitry Dolgov (#2)
RE: [Bus error] huge_pages default value (try) not fall back

-----Original Message-----
From: Dmitry Dolgov <9erthalion6@gmail.com>
Sent: 2020年2月18日 17:33
To: Fan Liu <fan.liu@ericsson.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: [Bus error] huge_pages default value (try) not fall back

On Tue, Feb 18, 2020 at 07:52:50AM +0000, Fan Liu wrote:
Hi,

We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.
But we also want to know why huge_pages default value(try) didn't fall back.

K8s BUG
https://protect2.fireeye.com/v1/url?k=dbfabaf1-872eb600-dbfafa6a-86468
5b2085c-354ea5332684eaef&q=1&e=4521865a-6ad9-42a9-b74a-2b5462a7c73b&u=
https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fissues%2F71233

Problem quick summary:
When hugepage not working, initdb produce bus error.

Thanks for reporting!

This one is fun. If I understand everything correctly, Postgres will fall back to non huge pages if it fails to allocate some. But in this case kernel actually allocates everything without problems (there are some available huge pages on a node after all), and return SIGBUS only when a first page fault within this cgroup happened, see the docs [1]https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt:

The HugeTLB controller allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since
HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if
it tries to access HugeTLB pages beyond its limit. This requires the
application to know beforehand how much HugeTLB pages it would
require for its use.

Unfortunately I'm not sure what would be the best solution in this situation.

[1]: https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt

---------------------------------------------------------

Hi Dmitry,
Thank you for the explanation.

In the K8s BUG https://protect2.fireeye.com/v1/url?k=dbfabaf1-872eb600-dbfafa6a-86468, there is someone proposed a workaround.

"Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."

I am working on this workaround , but has not really tested yet. So, do you think this could avoid this issue? Or do you see any side impact for this workaround?

BRs,
Fan Liu

BRs,
Fan Liu
ADP Document Database PG

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Fan Liu (#3)
Re: [Bus error] huge_pages default value (try) not fall back

On Tue, Feb 18, 2020 at 12:31:51PM +0000, Fan Liu wrote:

"Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."

I am working on this workaround , but has not really tested yet. So, do you think this could avoid this issue? Or do you see any side impact for this workaround?

If you don't necessarily need to use huge pages, then yes, I guess it
should work. In case if initdb tries to read config from some other
location, you can always point it to whatever you need via -L option.

#5Fan Liu
fan.liu@ericsson.com
In reply to: Fan Liu (#1)
RE: [Bus error] huge_pages default value (try) not fall back

-----Original Message-----
From: Fan Liu
Sent: 2020年2月21日 10:15
To: Dmitry Dolgov <9erthalion6@gmail.com>; pgsql-bugs@lists.postgresql.org
Subject: RE: [Bus error] huge_pages default value (try) not fall back

-----Original Message-----

From: Dmitry Dolgov <9erthalion6@gmail.com>
Sent: 2020年2月19日 17:36
To: Fan Liu <fan.liu@ericsson.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: [Bus error] huge_pages default value (try) not fall back

On Tue, Feb 18, 2020 at 12:31:51PM +0000, Fan Liu wrote:

"Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."

I am working on this workaround , but has not really tested yet. So, do you think this could avoid this issue? Or do you see any side impact for this workaround?

If you don't necessarily need to use huge pages, then yes, I guess it should work. In case if initdb tries to read config from some other location, you can always point it to whatever you need via -L option.

-----------------------------------

Hi Dmitry,

I had try the workaround. The result is that there is still bus error, but postgresql did come up.

I am not that understand why this could happen.

Attached core dump file, could you take a look?

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/postgresql10/bin/postgres --boot -x0 -F -c max_connections=20 -c share', real uid: 26, effective uid: 26, real gid: 26, effective gid: 26, execfn: '/usr/lib/postgresql10/bin/postgres', platform: 'x86_64'

BRs,
Fan Liu

Attachments:

coreapplication/octet-stream; name=coreDownload+1-2
#6Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Fan Liu (#5)
Re: [Bus error] huge_pages default value (try) not fall back

On Fri, Feb 21, 2020, 3:19 AM Fan Liu <fan.liu@ericsson.com> wrote:

Attached core dump file, could you take a look?

I can take a look on Monday, but at the same time if you have issues on
initdb stage you try to run it with -d option and check out debugging
output, should be helpful.

#7Fan Liu
fan.liu@ericsson.com
In reply to: Dmitry Dolgov (#6)
RE: [Bus error] huge_pages default value (try) not fall back

From: Dmitry Dolgov <9erthalion6@gmail.com>
Sent: 2020年2月22日 2:58
To: Fan Liu <fan.liu@ericsson.com>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: [Bus error] huge_pages default value (try) not fall back

On Fri, Feb 21, 2020, 3:19 AM Fan Liu <fan.liu@ericsson.com<mailto:fan.liu@ericsson.com>> wrote:

Attached core dump file, could you take a look?

I can take a look on Monday, but at the same time if you have issues on initdb stage you try to run it with -d option and check out debugging output, should be helpful.

Hi Dmitry,

Appreciate for your support.
I will working on a new package and ask my collector for validation and collect logs.

BRs,
Fan Liu

#8Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Fan Liu (#7)
Re: [Bus error] huge_pages default value (try) not fall back

On Fri, Feb 21, 2020, 3:19 AM Fan Liu <fan.liu@ericsson.com<mailto:fan.liu@ericsson.com>> wrote:

Attached core dump file, could you take a look?

I can take a look on Monday, but at the same time if you have issues
on initdb stage you try to run it with -d option and check out
debugging output, should be helpful.

Unfortunately, I wasn't able to get a meaningful stack trace from this
dump, most likely due to different versions (I hoped that the latest
pgdg package with 10 for bionic would fit). But you can also try to post
it following this instructions [1]https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend.

[1]: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

#9Fan Liu
fan.liu@ericsson.com
In reply to: Dmitry Dolgov (#8)
RE: [Bus error] huge_pages default value (try) not fall back

I had created a new package for my customer with -d for initdb, but they said they can accept current workaround.
As I don't have an NODE has hugepage on, I will not able to collect the logs.

I'd like to thank you again for the supporting and troubleshooting. I think we may close this ticket.

BRs,
Fan Liu
ADP Document Database PG

-----Original Message-----
From: Dmitry Dolgov <9erthalion6@gmail.com>
Sent: 2020年2月24日 18:39
To: Fan Liu <fan.liu@ericsson.com>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: [Bus error] huge_pages default value (try) not fall back

On Fri, Feb 21, 2020, 3:19 AM Fan Liu <fan.liu@ericsson.com<mailto:fan.liu@ericsson.com>> wrote:

Attached core dump file, could you take a look?

I can take a look on Monday, but at the same time if you have issues
on initdb stage you try to run it with -d option and check out
debugging output, should be helpful.

Unfortunately, I wasn't able to get a meaningful stack trace from this dump, most likely due to different versions (I hoped that the latest pgdg package with 10 for bionic would fit). But you can also try to post it following this instructions [1]https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend.

[1]: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

#10Odin Ugedal
odin@ugedal.com
In reply to: Fan Liu (#9)
Re: [Bus error] huge_pages default value (try) not fall back

Hi,

I stumbled upon this issue when working with the related issue in
Kubernetes that was referenced a few mails behind. So from what I
understand, it looks like this issue is/may be a result of how hugetlb
cgroup is enforcing the "limit_in_bytes" limit for huge pages. A
process should theoretically don't segfault like this under normal
circumstances when using memory received from a successful mmap. The
value set to "limit_in_bytes" is only enforced during page allocation,
and _not_ when mapping pages using mmap. This results in a successful
mmap for -n- huge pages as long as the system has -n- free hugepages,
even though the size is bigger than "limit_in_bytes". The process then
reserves the huge page memory, and makes it inaccessible to other
processes.

The real issue is when postgres tries to write to the memory it
received from mmap, and the kernel tries to allocate the reserved huge
page memory. Since it is not allowed to do so by the cgroup, the
process segfaults.

This issue has been fixed in Linux this patch
https://lkml.org/lkml/2020/2/3/1153, that adds a new element of
control to the cgroup that will fix this issue. There are however no
container runtimes that use it yet, and only 5.7+ (afaik.) kernels
support it, but the progress can be tracked here:
https://github.com/opencontainers/runtime-spec/issues/1050. The fix
for the upstream Kubernetes issue
(https://github.com/opencontainers/runtime-spec/issues/1050) that made
kubernetes set wrong value to the top level "limit_in_bytes" when the
pre-allocated page count increased after kubernetes (kubelet) startup,
will hopefully land in Kubernetes 1.19 (or 1.20). Fingers crossed!

Hopefully this makes some sense, and gives some insights into the issue...

Best regards,
Odin Ugedal

#11Fan Liu
fan.liu@ericsson.com
In reply to: Odin Ugedal (#10)
RE: [Bus error] huge_pages default value (try) not fall back

Thank you so much for the information.

BRs,
Fan Liu
ADP Document Database PG

Show quoted text

-----Original Message-----
From: Odin Ugedal <odin@ugedal.com>
Sent: 2020年6月9日 23:23
To: Fan Liu <fan.liu@ericsson.com>
Cc: Dmitry Dolgov <9erthalion6@gmail.com>; PostgreSQL mailing lists
<pgsql-bugs@lists.postgresql.org>
Subject: Re: [Bus error] huge_pages default value (try) not fall back

Hi,

I stumbled upon this issue when working with the related issue in Kubernetes
that was referenced a few mails behind. So from what I understand, it looks
like this issue is/may be a result of how hugetlb cgroup is enforcing the
"limit_in_bytes" limit for huge pages. A process should theoretically don't
segfault like this under normal circumstances when using memory received from
a successful mmap. The value set to "limit_in_bytes" is only enforced during
page allocation, and _not_ when mapping pages using mmap. This results in a
successful mmap for -n- huge pages as long as the system has -n- free hugepages,
even though the size is bigger than "limit_in_bytes". The process then reserves
the huge page memory, and makes it inaccessible to other processes.

The real issue is when postgres tries to write to the memory it received from
mmap, and the kernel tries to allocate the reserved huge page memory. Since
it is not allowed to do so by the cgroup, the process segfaults.

This issue has been fixed in Linux this patch
https://protect2.fireeye.com/v1/url?k=41942750-1f34c7c4-419467cb-86d2114ea
b2f-4c9655dbe24776b3&q=1&e=4467c237-1149-49f1-ab6c-bc0a3c31b0f3&u=https%3A
%2F%2Flkml.org%2Flkml%2F2020%2F2%2F3%2F1153, that adds a new element of
control to the cgroup that will fix this issue. There are however no container
runtimes that use it yet, and only 5.7+ (afaik.) kernels support it, but the
progress can be tracked here:
https://protect2.fireeye.com/v1/url?k=8e01d669-d0a136fd-8e0196f2-86d2114ea
b2f-dd1ff954a0920218&q=1&e=4467c237-1149-49f1-ab6c-bc0a3c31b0f3&u=https%3A
%2F%2Fgithub.com%2Fopencontainers%2Fruntime-spec%2Fissues%2F1050. The fix
for the upstream Kubernetes issue
(https://protect2.fireeye.com/v1/url?k=5d33f1ab-0393113f-5d33b130-86d2114e
ab2f-38b5ca047e5124c3&q=1&e=4467c237-1149-49f1-ab6c-bc0a3c31b0f3&u=https%3
A%2F%2Fgithub.com%2Fopencontainers%2Fruntime-spec%2Fissues%2F1050) that made
kubernetes set wrong value to the top level "limit_in_bytes" when the
pre-allocated page count increased after kubernetes (kubelet) startup, will
hopefully land in Kubernetes 1.19 (or 1.20). Fingers crossed!

Hopefully this makes some sense, and gives some insights into the issue...

Best regards,
Odin Ugedal