Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS

Started by Ian J Cotteeover 1 year ago4 messagesgeneral

ian@cottee.org

over 1 year ago

Hello everyone, I’ve been using postgres for over 25 years now and never
had any major issues which were not caused by my own stupidity. In the last
24 hours however I’ve had a number of issues on one client's server which I
assume are a bug in postgres or a possible hardware issue (they are running
on a Linode) but I need some clarification and would welcome advice on how
to proceed. I will also forward this mail to Linode support to ask them to
check for any memory issues they can detect.

This particular Postgres is running on Ubuntu LTS 22.04 and has the
following version information:

```

PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

```

The quick summary is that over a 24 hour period I had the following errors
appear in the postgres logs at different times causing the system processes
to restart:

- stuck spinlock detected
- free(): corrupted unsorted chunks
- double free or corruption (!prev)
- corrupted size vs. prev_size
- corrupted double-linked list
- *** stack smashing detected ***: terminated
- Segmentation fault

Here’s the more detailed breakdown.

On Monday evening this week, the following event occurred on the server

```

2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock detected
at LWLockWaitListLock, ./build/../src/backend/storage/lmgr/lwlock.c:913

```

Followed by:

```

2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active
server processes

2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes terminated;
reinitializing

```

And eventually

```

2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system is
in recovery mode

2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly
shut down; automatic recovery in progress

2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8

2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at
DD/405EF818: wanted 24, got 0

2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to
accept connections

```

This wasn’t noticed by myself or any users as they tend to all be finished
by 17:30. However later,

```

2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on client
connection with an open transaction

2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on client
connection with an open transaction

free(): corrupted unsorted chunks

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was
terminated by signal 6: Aborted

2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active
server processes

```

This time it could not recover and I didn’t notice until early the next
morning whilst doing some routine checks.

```

2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly
shut down; automatic recovery in progress

2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740

2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at
DD/475452A0: wanted 24, got 0

2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system is
in recovery mode

double free or corruption (!prev)

2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807) was
terminated by signal 6: Aborted

2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup
process failure

2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down

```

When I noticed in the morning it was able to start without an issue. From
googling it appeared to be a memory issue and I wondered if the problem was
sorted now the server process had stopped completely and restarted. The
problem was not sorted although all the above errors were recovered from
automatically without any input from myself or the client’s noticing.

```

corrupted size vs. prev_size

2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel
worker" (PID 947642) was terminated by signal 6: Aborted

```

corrupted double-linked list

2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel
worker" (PID 1019071) was terminated by signal 6: Aborted

```

*** stack smashing detected ***: terminated

2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel
worker" (PID 528630) was terminated by signal 6: A\ borted

```

2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel
worker" (PID 533515) was terminated by signal 11: \

Segmentation fault

2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running:
SELECT "formula_line".id FROM "formul\

```

I rebooted the server at 18:30 and have had no further issues so far,
although work has yet to start. When rebooting the server, postgres seemed
to take a long time to terminate.

Now there is one odd thing that has been happening recently. Due to a bug
in my code I've had more deadlocks than would normally be expected.

```

2024-10-29 19:26:51.680 GMT [71152] xxx@xxx ERROR: could not serialize
access due to concurrent update

```

I believe I have fixed that bug in my code this morning and the errors
above did not seem to coincide with the errors appearing but I'm raising it
in case related.

Comments and insights are warmly welcomed.

Best regards

Ian Cottee

Vijaykumar Jain

vijaykumarjain.github@gmail.com

over 1 year ago

In reply to: Ian J Cottee (#1)

Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS

On Wed, 30 Oct 2024 at 13:04, Ian J Cottee <ian@cottee.org> wrote:

Hello everyone, I’ve been using postgres for over 25 years now and never
had any major issues which were not caused by my own stupidity. In the last
24 hours however I’ve had a number of issues on one client's server which I
assume are a bug in postgres or a possible hardware issue (they are running
on a Linode) but I need some clarification and would welcome advice on how
to proceed. I will also forward this mail to Linode support to ask them to
check for any memory issues they can detect.

This particular Postgres is running on Ubuntu LTS 22.04 and has the
following version information:

```

PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

```

The quick summary is that over a 24 hour period I had the following errors
appear in the postgres logs at different times causing the system processes
to restart:

- stuck spinlock detected
- free(): corrupted unsorted chunks
- double free or corruption (!prev)
- corrupted size vs. prev_size
- corrupted double-linked list
- *** stack smashing detected ***: terminated
- Segmentation fault

are you using the postgresql setup compiled from source ?
listing the output of pg_config may give the details

are there any extensions installed, can you list those extensions.

if you have access to source packages ,
Getting a stack trace of a running PostgreSQL backend on Linux/BSD -
PostgreSQL wiki
<https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD>
can you generate a stacktrace from the process that is crashing or if it
dumped a core, then backtrace from the core dump.

will it be possible to share the actual logs both postgresql and kernel
around the incident ...

Do you have access to core dumps which these crashes may have generated ? i
think ABRT / segmentation faults would generate one.

do you collect stats of system. around the time of crash do you any
abnormal usage of io or cpu or memory , along with locks held in postgresql
setup etc.

Here’s the more detailed breakdown.

On Monday evening this week, the following event occurred on the server

```

2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock
detected at LWLockWaitListLock,
./build/../src/backend/storage/lmgr/lwlock.c:913

```

I think a backtrace here would help what part of call stack led to this.
this alone does not look like any bug.

Followed by:

```

2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active
server processes

2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes
terminated; reinitializing

```

And eventually

```

2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system
is in recovery mode

2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly
shut down; automatic recovery in progress

2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8

2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at
DD/405EF818: wanted 24, got 0

2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to
accept connections

```

This wasn’t noticed by myself or any users as they tend to all be finished
by 17:30. However later,

```

2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on
client connection with an open transaction

2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on
client connection with an open transaction

free(): corrupted unsorted chunks

it all seems like memory corruption or some leak ... valgrind ? to get more
details if leak ...

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was
terminated by signal 6: Aborted

2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active
server processes

```

This time it could not recover and I didn’t notice until early the next
morning whilst doing some routine checks.

```

2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly
shut down; automatic recovery in progress

2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740

2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at
DD/475452A0: wanted 24, got 0

2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system
is in recovery mode

double free or corruption (!prev)

2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807)
was terminated by signal 6: Aborted

2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup
process failure

2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down

```

When I noticed in the morning it was able to start without an issue. From
googling it appeared to be a memory issue and I wondered if the problem was
sorted now the server process had stopped completely and restarted. The
problem was not sorted although all the above errors were recovered from
automatically without any input from myself or the client’s noticing.

```

corrupted size vs. prev_size

2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel
worker" (PID 947642) was terminated by signal 6: Aborted

```

```

corrupted double-linked list

2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel
worker" (PID 1019071) was terminated by signal 6: Aborted

```

```

*** stack smashing detected ***: terminated

2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel
worker" (PID 528630) was terminated by signal 6: A\ borted

```

```

2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel
worker" (PID 533515) was terminated by signal 11: \

Segmentation fault

2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running:
SELECT "formula_line".id FROM "formul\

```

idk why it crashed with sigabrt instead of sigkill if it was indeed a
memory leak and not a bug ... so not sure memory overcommiting can be of
use here ...

how much is the concurrency at peak and with what work mem .... any theory
of excessive work mem and too many concurrent processes holding some locks
for long ...
it should not crash even if that is the case, but just asking ...

lastly.... is it possible to memcheck run on the machine just to ensure no
memory scares ... if this is running on a vm, or bare metal ,,,, any
hardware errors around that time ?

most likely it looks like a h/w issue, we used to see things like this on
bare metals .... which only happened occasionally and then frequently till
we moved away from that setup.

also, does it happen only when the optimiser picks a plan involving
parallel workers for a query?
If you set max_parallel_workers_per_gather to 0, to not
parallelize anything , do you still see the issue ?

Just insights, if not useful, pls ignore.

Best regards

Ian Cottee

--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain <https://github.com/cabecada>

Ian J Cottee

ian@cottee.org

over 1 year ago

In reply to: Vijaykumar Jain (#2)

Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS

Dear Vijay

Many thanks for the comprehensive response.

The installed Postgres is just the current standard Ubuntu 22.04 LTS
release. The output of pg_config is below. Since the reboot we've not had
the error again (even though I'm still getting concurrency issues) and
having reached out to Linode they suggested the same as you suggested,
running a memtest86+ check on it which I will do this weekend. As the
previous errors (thankfully) are not showing now I can't really do any more
debugging but I'll report back on the results of the memtest.

Once again, many thanks for your thoughts and help, it is much appreciated.

Best regards

Ian

pg_config output:

BINDIR = /usr/lib/postgresql/14/bin
DOCDIR = /usr/share/doc/postgresql-doc-14
HTMLDIR = /usr/share/doc/postgresql-doc-14
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/14/server
LIBDIR = /usr/lib/x86_64-linux-gnu
PKGLIBDIR = /usr/lib/postgresql/14/lib
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/14/man
SHAREDIR = /usr/share/postgresql/14
SYSCONFDIR = /etc/postgresql-common
PGXS = /usr/lib/postgresql/14/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--build=x86_64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc'
'--localstatedir=/var' '--disable-option-checki
ng' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu'
'--runstatedir=/run' '--disable-maintainer-mode'
'--disable-dependency-tracking' '--with-tcl' '--with-perl'
'--with-python' '--with-pam' '--with-open
ssl' '--with-libxml' '--with-libxslt'
'--mandir=/usr/share/postgresql/14/man'
'--docdir=/usr/share/doc/postgresql-doc-14'
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/'
'--datadir=/usr/share/postgresql
/14' '--bindir=/usr/lib/postgresql/14/bin'
'--libdir=/usr/lib/x86_64-linux-gnu/'
'--libexecdir=/usr/lib/postgresql/'
'--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu
14.13-0ubuntu0.22.04.1)' '--enab
le-nls' '--enable-thread-safety' '--enable-debug' '--enable-dtrace'
'--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-gssapi'
'--with-ldap' '--with-pgport=5432'
'--with-system-tzdata=/usr/share/zoneinfo' 'AWK=
mawk' 'MKDIR_P=/bin/mkdir -p' 'PROVE=/usr/bin/prove'
'PYTHON=/usr/bin/python3' 'TAR=/bin/tar' 'XSLTPROC=xsltproc --nonet'
'CFLAGS=-g -O2 -flto=auto -ffat-lto-objects -flto=auto
-ffat-lto-objects -fstack-protector-strong
-Wformat -Werror=format-security -fno-omit-frame-pointer'
'LDFLAGS=-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects
-flto=auto -Wl,-z,relro -Wl,-z,now' '--enable-tap-tests' '--with-icu'
'--with-llvm' 'LLVM_CONFIG=/u
sr/bin/llvm-config-14' 'CLANG=/usr/bin/clang-14' '--with-lz4'
'--with-systemd' '--with-selinux' 'build_alias=x86_64-linux-gnu'
'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -flto=auto
-ffat-lto-objects -flt
o=auto -ffat-lto-objects -fstack-protector-strong -Wformat
-Werror=format-security'
CC = gcc
CPPFLAGS = -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE
-I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -g -g -O2 -flto=auto -ffat-lto-objects
-flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat
-Werror=format-security -fno-omit
-frame-pointer
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects
-flto=auto -Wl,-z,relro -Wl,-z,now -L/usr/lib/llvm-14/lib
-Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lselinux -llz4 -lxslt -lxml2 -lpam -lssl
-lcrypto -lgssapi_krb5 -lz -lreadline -lm
VERSION = PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1)

On Wed, 30 Oct 2024 at 11:39, Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

Show quoted text

On Wed, 30 Oct 2024 at 13:04, Ian J Cottee <ian@cottee.org> wrote:

Hello everyone, I’ve been using postgres for over 25 years now and never
had any major issues which were not caused by my own stupidity. In the last
24 hours however I’ve had a number of issues on one client's server which I
assume are a bug in postgres or a possible hardware issue (they are running
on a Linode) but I need some clarification and would welcome advice on how
to proceed. I will also forward this mail to Linode support to ask them to
check for any memory issues they can detect.

This particular Postgres is running on Ubuntu LTS 22.04 and has the
following version information:

```

PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

```

The quick summary is that over a 24 hour period I had the following
errors appear in the postgres logs at different times causing the system
processes to restart:

- stuck spinlock detected
- free(): corrupted unsorted chunks
- double free or corruption (!prev)
- corrupted size vs. prev_size
- corrupted double-linked list
- *** stack smashing detected ***: terminated
- Segmentation fault

are you using the postgresql setup compiled from source ?
listing the output of pg_config may give the details

are there any extensions installed, can you list those extensions.

if you have access to source packages ,
Getting a stack trace of a running PostgreSQL backend on Linux/BSD -
PostgreSQL wiki
<https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD>
can you generate a stacktrace from the process that is crashing or if it
dumped a core, then backtrace from the core dump.

will it be possible to share the actual logs both postgresql and kernel
around the incident ...

Do you have access to core dumps which these crashes may have generated ?
i think ABRT / segmentation faults would generate one.

do you collect stats of system. around the time of crash do you any
abnormal usage of io or cpu or memory , along with locks held in postgresql
setup etc.

Here’s the more detailed breakdown.

On Monday evening this week, the following event occurred on the server

```

2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock
detected at LWLockWaitListLock,
./build/../src/backend/storage/lmgr/lwlock.c:913

```

I think a backtrace here would help what part of call stack led to this.
this alone does not look like any bug.

Followed by:

```

2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active
server processes

2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes
terminated; reinitializing

```

And eventually

```

2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system
is in recovery mode

2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not
properly shut down; automatic recovery in progress

2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8

2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at
DD/405EF818: wanted 24, got 0

2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to
accept connections

```

This wasn’t noticed by myself or any users as they tend to all be
finished by 17:30. However later,

```

2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on
client connection with an open transaction

2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on
client connection with an open transaction

free(): corrupted unsorted chunks

it all seems like memory corruption or some leak ... valgrind ? to get
more details if leak ...

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803)
was terminated by signal 6: Aborted

2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active
server processes

```

This time it could not recover and I didn’t notice until early the next
morning whilst doing some routine checks.

```

2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not
properly shut down; automatic recovery in progress

2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740

2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at
DD/475452A0: wanted 24, got 0

2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system
is in recovery mode

double free or corruption (!prev)

2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807)
was terminated by signal 6: Aborted

2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to
startup process failure

2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down

```

When I noticed in the morning it was able to start without an issue. From
googling it appeared to be a memory issue and I wondered if the problem was
sorted now the server process had stopped completely and restarted. The
problem was not sorted although all the above errors were recovered from
automatically without any input from myself or the client’s noticing.

```

corrupted size vs. prev_size

2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel
worker" (PID 947642) was terminated by signal 6: Aborted

```

```

corrupted double-linked list

2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel
worker" (PID 1019071) was terminated by signal 6: Aborted

```

```

*** stack smashing detected ***: terminated

2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel
worker" (PID 528630) was terminated by signal 6: A\ borted

```

```

2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel
worker" (PID 533515) was terminated by signal 11: \

Segmentation fault

2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running:
SELECT "formula_line".id FROM "formul\

```

idk why it crashed with sigabrt instead of sigkill if it was indeed a
memory leak and not a bug ... so not sure memory overcommiting can be of
use here ...

how much is the concurrency at peak and with what work mem .... any theory
of excessive work mem and too many concurrent processes holding some locks
for long ...
it should not crash even if that is the case, but just asking ...

lastly.... is it possible to memcheck run on the machine just to ensure no
memory scares ... if this is running on a vm, or bare metal ,,,, any
hardware errors around that time ?

most likely it looks like a h/w issue, we used to see things like this on
bare metals .... which only happened occasionally and then frequently till
we moved away from that setup.

also, does it happen only when the optimiser picks a plan involving
parallel workers for a query?
If you set max_parallel_workers_per_gather to 0, to not
parallelize anything , do you still see the issue ?

Just insights, if not useful, pls ignore.

Best regards

Ian Cottee

--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain <https://github.com/cabecada>

Vijaykumar Jain

vijaykumarjain.github@gmail.com

over 1 year ago

In reply to: Ian J Cottee (#3)

Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS

On Sat, 2 Nov 2024 at 12:50, Ian J Cottee <ian@cottee.org> wrote:

As the previous errors (thankfully) are not showing now I can't really do any more debugging but I'll report back on the results of the memtest.

just for the sake of this thread, i wanted to make a mention of

Summary of Errors reported early on... (

Stuck Spinlock Detected: This error indicates that a thread is unable
to acquire a lock, which can happen due to contention or a deadlock
situation. It often points to performance issues or bugs in the
PostgreSQL codebase.
Free(): Corrupted Unsorted Chunks: This message typically arises from
memory management issues, particularly when the memory allocator
detects corruption in its internal structures. This could be
indicative of bugs in the application code or PostgreSQL itself.
Double Free or Corruption (!prev): This error occurs when the program
attempts to free memory that has already been freed or is corrupted.
It can lead to crashes and is often linked to improper memory handling
in the application.
Corrupted Size vs. Prev_Size: Similar to the previous errors, this
suggests that there is a mismatch in expected memory sizes, which can
be caused by buffer overflows or improper memory allocation.
Corrupted Double-Linked List: This indicates that the internal data
structures used for managing memory are corrupted, often resulting
from improper memory handling.
Stack Smashing Detected: This error occurs when a program writes more
data to a buffer than it can hold, corrupting adjacent memory. It’s a
common sign of buffer overflow vulnerabilities.
Segmentation Fault: A segmentation fault indicates that the program
tried to access an invalid memory location, leading to a crash. This
is often a result of programming errors such as dereferencing null
pointers or accessing out-of-bounds array elements.

yes, i used perplexity to understand these errors and asked for what
scenarios do we get those errors, it may not be all correct, or may
even be wrong, which i'll have dive more in later.
but if these are not hardware errors, then there can be serious bugs
which can be exploited for vulnerabilities ... around buffer
overflows.

if you have a standard installation from ubuntu package binaries ....
i think there might be some lower level C code in functions? or
extensions also in play which can lead to above errors.i hope there
was a core dumped somewhere and a stacktrace might have some info on
what led to those errors could help understand if there is unsafe code
somewhere and what queries resulted in that.

I understand this forum is not to discuss AI responses ... but I also
did not want to ignore them due to lack of knowledge... hence sharing.
as always, i can be corrected or wronged