Segmentation Fault PG 14
Hello!
I started to use version `14.5-2.pgdg20.04+2` for a dedicated database and
I'm facing many segmentation faults during the day when the database has
more heavy queries.
The server log there are many of this:
```
2022-11-07 17:23:19.423 UTC [728] LOG: background worker "parallel worker"
(PID 9558) was terminated by signal 11: Segmentation fault
2022-11-07 17:23:19.423 UTC [728] DETAIL: Failed process was running:
select blablabla from heavyquery where ...;
2022-11-07 17:23:19.423 UTC [728] LOG: terminating any other active server
processes
2022-11-07 17:23:19.681 UTC [9588] microservice@microservice FATAL: the
database system is in recovery mode
2022-11-07 17:23:19.683 UTC [9589] microservice@microservice FATAL: the
database system is in recovery mode
2022-11-07 17:23:24.543 UTC [728] LOG: all server processes terminated;
reinitializing
2022-11-07 17:23:24.894 UTC [9622] LOG: database system was interrupted;
last known up at 2022-11-07 17:22:07 UTC
2022-11-07 17:23:25.636 UTC [9622] LOG: invalid record length at
134/227A3A68: wanted 24, got 0
2022-11-07 17:23:25.636 UTC [9622] LOG: redo done at 134/227A3A38 system
usage: CPU: user: 0.04 s, system: 0.06 s, elapsed: 0.70 s
2022-11-07 17:23:27.608 UTC [728] LOG: database system is ready to accept
connections
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] LOG: could not
receive data from client: Connection reset by peer
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] STATEMENT:
START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] LOG: unexpected EOF
on standby connection
2022-11-07 17:23:33.474 UTC [9635] replica@[unknown] STATEMENT:
START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] LOG: could not
receive data from client: Connection reset by peer
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] STATEMENT:
START_REPLICATION 134/22000000 TIMELINE 1
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] LOG: unexpected EOF
on standby connection
2022-11-07 17:23:51.310 UTC [9662] replica@[unknown] STATEMENT:
START_REPLICATION 134/22000000 TIMELINE 1
INFO: 2022/11/07 17:23:51.445710 FILE PATH: 000000010000013400000022.lz4
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] LOG: could not
receive data from client: Connection reset by peer
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] STATEMENT:
START_REPLICATION 134/23000000 TIMELINE 1
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] LOG: unexpected EOF
on standby connection
2022-11-07 17:24:09.206 UTC [9672] replica@[unknown] STATEMENT:
START_REPLICATION 134/23000000 TIMELINE 1
INFO: 2022/11/07 17:24:27.527897 FILE PATH: 000000010000013400000023.lz4
INFO: 2022/11/07 17:24:38.076058 FILE PATH: 000000010000013400000024.lz4
```
It's server is running in ubuntu 22.04 in aarch64 (ARM architecture)
I could also get a little information from gdb, I'm not sure if it will
help:
```
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/14/bin/postgres...
Reading symbols from
/usr/lib/debug/.build-id/d7/87a0cf1bb645b349f7c137e36cc30f7ba8805f.debug...
[New LWP 9559]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 14/main: parallel worker for PID 9528
'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000100000c757c9c in ?? ()
(gdb) bt
#0 0x000100000c757c9c in ?? ()
#1 0x0000ffff0c757124 in ?? ()
#2 0x0000aaaac2ac9970 in ExecProcNode (node=0xaaaafc599818) at
./build/../src/include/executor/executor.h:257
#3 ExecAppend (pstate=0xaaaafc595918) at
./build/../src/backend/executor/nodeAppend.c:360
#4 0x0000aaaac2ac9970 in ExecProcNode (node=0xaaaafc595918) at
./build/../src/include/executor/executor.h:257
#5 ExecAppend (pstate=0xaaaafc526988) at
./build/../src/backend/executor/nodeAppend.c:360
#6 0x0000000000000001 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
```
Has anyone already faced this problem or may know a solution?
Thanks in advance.
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
Willian Colognesi <willian_colognesi@trimble.com> writes:
I started to use version `14.5-2.pgdg20.04+2` for a dedicated database and
I'm facing many segmentation faults during the day when the database has
more heavy queries.
I take it things were okay with the version you used previously?
What was that exactly? Has anything else changed?
I could also get a little information from gdb, I'm not sure if it will
help:
This looks pretty messed up. Are you sure the debug symbols you're using
match the package?
Even better, can you construct a self-contained test case?
regards, tom lane
Hi Tom,
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
was running in this server to another using Logical Replication.
the process was basically this:
CREATE PUBLICATION my_database_pub FOR ALL TABLES;
postgres@origin:~$ psql "dbname=<my_database> replication=database"
my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;
pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d
<my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f
</mnt/dump>
postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5
</mnt/dump>
CREATE SUBSCRIPTION <name_sub>
CONNECTION 'host=<host> dbname=<my_database> user=replica
password=?? port=5432'
PUBLICATION <name_pub>
WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);
After this migration we started to have this kind of problem in both
replica and primary servers.
`This looks pretty messed up. Are you sure the debug symbols you're using`
What exactly do you mean? I'm not too familiar with this debug toolings,
the packages I've used were:
postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
`Even better, can you construct a self-contained test case?`:
Actually I couldn't reproduce the problem because it's happening just in a
production database, and it doesn't look to have a pattern in the cases
when it happens.
Is there anything I could provide you to help the analysis ?
On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
I started to use version `14.5-2.pgdg20.04+2` for a dedicated database
and
I'm facing many segmentation faults during the day when the database has
more heavy queries.I take it things were okay with the version you used previously?
What was that exactly? Has anything else changed?I could also get a little information from gdb, I'm not sure if it will
help:This looks pretty messed up. Are you sure the debug symbols you're using
match the package?Even better, can you construct a self-contained test case?
regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
On 11/7/22 10:36 AM, Willian Colognesi wrote:
Hi Tom,
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database
that was running in this server to another using Logical Replication.
Actually you used dump/restore and logical replication. '
In below:
1) What versions of pg_dump and pg_restore did you use?
2) To be clear the subscription was started after the restore?
3) Where there any error messages issued at any point in below?
4) Are the database clusters on the same machine?
the process was basically this:
|CREATE| |PUBLICATION my_database_pub ||FOR| |ALL| |TABLES;|
|postgres@origin:~$ psql "dbname=<my_database> replication=database"
|
|
|my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;|
pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d
<my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f
</mnt/dump>
postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5
</mnt/dump>CREATE SUBSCRIPTION <name_sub>
CONNECTION 'host=<host> dbname=<my_database> user=replica
password=?? port=5432'
PUBLICATION <name_pub>
WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);
|After this migration we started to have this kind of problem in both
replica and primary servers.`This looks pretty messed up. Are you sure the debug symbols you're using`
What exactly do you mean? I'm not too familiar with this debug toolings,
the packages I've used were:postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]`Even better, can you construct a self-contained test case?`:
Actually I couldn't reproduce the problem because it's happening just in
a production database, and it doesn't look to have a pattern in the
cases when it happens.Is there anything I could provide you to help the analysis ?
On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:Willian Colognesi <willian_colognesi@trimble.com
<mailto:willian_colognesi@trimble.com>> writes:I started to use version `14.5-2.pgdg20.04+2` for a dedicated
database and
I'm facing many segmentation faults during the day when the
database has
more heavy queries.
I take it things were okay with the version you used previously?
What was that exactly? Has anything else changed?I could also get a little information from gdb, I'm not sure if
it will
help:
This looks pretty messed up. Are you sure the debug symbols you're
using
match the package?Even better, can you construct a self-contained test case?
regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi
*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
--
Adrian Klaver
adrian.klaver@aklaver.com
1) What versions of pg_dump and pg_restore did you use?
A: pg_dump and pg_restore was done using pg 14 (the same as the destination
was running)
2) To be clear the subscription was started after the restore?
A: Yes
3) Where there any error messages issued at any point in below?
A: no errors during the dump and restore.
4) Are the database clusters on the same machine?
A: No, the origin and destination were different servers at the same VPC.
On Mon, Nov 7, 2022 at 3:49 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 11/7/22 10:36 AM, Willian Colognesi wrote:
Hi Tom,
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database
that was running in this server to another using Logical Replication.Actually you used dump/restore and logical replication. '
In below:
1) What versions of pg_dump and pg_restore did you use?
2) To be clear the subscription was started after the restore?
3) Where there any error messages issued at any point in below?
4) Are the database clusters on the same machine?
the process was basically this:
|CREATE| |PUBLICATION my_database_pub ||FOR| |ALL| |TABLES;|
|postgres@origin:~$ psql "dbname=<my_database> replication=database"
|
|
|my_database=# CREATE_REPLICATION_SLOT <slot_name> LOGICAL pgoutput;|
pg_dump -j4 -h <host> -p 5432 --no-subscriptions --no-publications -d
<my_database> --snapshot=<snapshot_generated> -Fd -U <my_user> -f
</mnt/dump>
postgres@destination:/mnt/database$ pg_restore -d <my_database> -j 5
</mnt/dump>CREATE SUBSCRIPTION <name_sub>
CONNECTION 'host=<host> dbname=<my_database> user=replica
password=?? port=5432'
PUBLICATION <name_pub>
WITH (slot_name=<slot_name>, create_slot=false, copy_data=false);
|After this migration we started to have this kind of problem in both
replica and primary servers.`This looks pretty messed up. Are you sure the debug symbols you're
using`
What exactly do you mean? I'm not too familiar with this debug toolings,
the packages I've used were:postgresql-14/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]
postgresql-14-dbgsym/focal-pgdg,now 14.5-2.pgdg20.04+2 arm64 [installed]`Even better, can you construct a self-contained test case?`:
Actually I couldn't reproduce the problem because it's happening just in
a production database, and it doesn't look to have a pattern in the
cases when it happens.Is there anything I could provide you to help the analysis ?
On Mon, Nov 7, 2022 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:Willian Colognesi <willian_colognesi@trimble.com
<mailto:willian_colognesi@trimble.com>> writes:I started to use version `14.5-2.pgdg20.04+2` for a dedicated
database and
I'm facing many segmentation faults during the day when the
database has
more heavy queries.
I take it things were okay with the version you used previously?
What was that exactly? Has anything else changed?I could also get a little information from gdb, I'm not sure if
it will
help:
This looks pretty messed up. Are you sure the debug symbols you're
using
match the package?Even better, can you construct a self-contained test case?
regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi
*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090--
Adrian Klaver
adrian.klaver@aklaver.com
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
On 11/7/22 10:57 AM, Willian Colognesi wrote:
1) What versions of pg_dump and pg_restore did you use?
A: pg_dump and pg_restore was done using pg 14 (the same as the
destination was running)2) To be clear the subscription was started after the restore?
A: Yes3) Where there any error messages issued at any point in below?
A: no errors during the dump and restore.4) Are the database clusters on the same machine?
A: No, the origin and destination were different servers at the same VPC.
Are servers using the same version of OS?
--
Adrian Klaver
adrian.klaver@aklaver.com
No, the origin where the database was was running ubuntu 18.04.5 x86_64 and
the destination ubuntu 20.04.5 aarch64
On Mon, Nov 7, 2022 at 4:00 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 11/7/22 10:57 AM, Willian Colognesi wrote:
1) What versions of pg_dump and pg_restore did you use?
A: pg_dump and pg_restore was done using pg 14 (the same as the
destination was running)2) To be clear the subscription was started after the restore?
A: Yes3) Where there any error messages issued at any point in below?
A: no errors during the dump and restore.4) Are the database clusters on the same machine?
A: No, the origin and destination were different servers at the same VPC.Are servers using the same version of OS?
--
Adrian Klaver
adrian.klaver@aklaver.com
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
On 11/7/22 11:03 AM, Willian Colognesi wrote:
No, the origin where the database was was running ubuntu 18.04.5 x86_64
and the destination ubuntu 20.04.5 aarch64
Where I was going was this:
https://wiki.postgresql.org/wiki/Locale_data_changes
Then I realized you had not done any binary upgrades, so that is a dead end.
--
Adrian Klaver
adrian.klaver@aklaver.com
Willian Colognesi <willian_colognesi@trimble.com> writes:
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
was running in this server to another using Logical Replication.
12.4 to 14.5 is kind of a big jump :-(.
The stack trace seems to indicate that ExecProcNode transferred control
to never-never land, which says that something clobbered the function
pointer it's trying to indirect through. I don't recall having seen
any similar reports though.
Are you using any extensions besides those that come with core Postgres?
A build incompatibility with some third-party extension might explain
this, perhaps.
One thing I'm curious about is that the stack trace seems to imply that
there was an Append plan node immediately below another Append. That
shouldn't happen AFAIK --- the planner tries to collapse out such
cases. Can you get us an EXPLAIN for the problem query?
regards, tom lane
All the extensions installed in this database are these:
```
List of installed extensions
Name | Version | Schema |
Description
--------------------+---------+------------+-----------------------------------------------------------
amcheck | 1.3 | public | functions for verifying
relation integrity
btree_gist | 1.6 | public | support for indexing common
datatypes in GiST
pg_stat_statements | 1.9 | public | track execution statistics of
all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(5 rows)
```
I tried to execute a query with parameters the query was supposed to be run
(because I'm not sure exactly the values in the where clause that made the
segmentation fault).
here is the explain: https://explain.depesz.com/s/Tql3 (Ps: I just had to
suppress the real table/index names)
Looks like since I've disable *jit* as Boris told, until now the database
did not restarted again... (not sure if it's coincidence)
On Mon, Nov 7, 2022 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
was running in this server to another using Logical Replication.12.4 to 14.5 is kind of a big jump :-(.
The stack trace seems to indicate that ExecProcNode transferred control
to never-never land, which says that something clobbered the function
pointer it's trying to indirect through. I don't recall having seen
any similar reports though.Are you using any extensions besides those that come with core Postgres?
A build incompatibility with some third-party extension might explain
this, perhaps.One thing I'm curious about is that the stack trace seems to imply that
there was an Append plan node immediately below another Append. That
shouldn't happen AFAIK --- the planner tries to collapse out such
cases. Can you get us an EXPLAIN for the problem query?regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
On 11/7/22 12:15, Willian Colognesi wrote:
All the extensions installed in this database are these:
```
List of installed extensions
Name | Version | Schema |
Description
--------------------+---------+------------+-----------------------------------------------------------
amcheck | 1.3 | public | functions for verifying
relation integrity
btree_gist | 1.6 | public | support for indexing
common datatypes in GiST
pg_stat_statements | 1.9 | public | track execution statistics
of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(5 rows)
```I tried to execute a query with parameters the query was supposed to be
run (because I'm not sure exactly the values in the where clause that
made the segmentation fault).here is the explain: https://explain.depesz.com/s/Tql3
<https://explain.depesz.com/s/Tql3> (Ps: I just had to suppress the real
table/index names)Looks like since I've disable *jit* as Boris told, until now the
database did not restarted again... (not sure if it's coincidence)
I did not see that post or suggestion.
What was the suggestion?
Are you saying the database does not start up now?
--
Adrian Klaver
adrian.klaver@aklaver.com
No, the database is running well, no problem until now after disabled *jit.*
I just realized that he send an email direct to me, the message was:
```
I had similar problems with and the cure was to turn off jit in
Postgres.conf
jit = off
--
Boris
```
On Mon, Nov 7, 2022 at 5:25 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 11/7/22 12:15, Willian Colognesi wrote:
All the extensions installed in this database are these:
```
List of installed extensions
Name | Version | Schema |
Description--------------------+---------+------------+-----------------------------------------------------------
amcheck | 1.3 | public | functions for verifying
relation integrity
btree_gist | 1.6 | public | support for indexing
common datatypes in GiST
pg_stat_statements | 1.9 | public | track execution statistics
of all SQL statements executed
pgcrypto | 1.3 | public | cryptographic functions
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedurallanguage
(5 rows)
```I tried to execute a query with parameters the query was supposed to be
run (because I'm not sure exactly the values in the where clause that
made the segmentation fault).here is the explain: https://explain.depesz.com/s/Tql3
<https://explain.depesz.com/s/Tql3> (Ps: I just had to suppress thereal
table/index names)
Looks like since I've disable *jit* as Boris told, until now the
database did not restarted again... (not sure if it's coincidence)I did not see that post or suggestion.
What was the suggestion?
Are you saying the database does not start up now?
--
Adrian Klaver
adrian.klaver@aklaver.com
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
Willian Colognesi <willian_colognesi@trimble.com> writes:
No, the database is running well, no problem until now after disabled *jit.*
Interesting. Which version of LLVM is installed?
regards, tom lane
Do you mean how it was compiled? the output of pg_config is it:
```
root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
'--build=aarch64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
'--disable-silent-rules' '--libdir=${prefix}/lib/aarch64-linux-gnu'
'--runstatedir=/run' '--disable-maintainer-mode'
'--disable-dependency-tracking' '--with-tcl' '--with-perl' '--with-python'
'--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt'
'--mandir=/usr/share/postgresql/14/man'
'--docdir=/usr/share/doc/postgresql-doc-14'
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/'
'--datadir=/usr/share/postgresql/14' '--bindir=/usr/lib/postgresql/14/bin'
'--libdir=/usr/lib/aarch64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/'
'--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu
14.5-2.pgdg20.04+2)' '--enable-nls' '--enable-thread-safety'
'--enable-debug' '--enable-dtrace' '--disable-rpath' '--with-uuid=e2fs'
'--with-gnu-ld' '--with-gssapi' '--with-ldap' '--with-pgport=5432'
'--with-system-tzdata=/usr/share/zoneinfo' 'AWK=mawk' 'MKDIR_P=/bin/mkdir
-p' 'PROVE=/usr/bin/prove' 'PYTHON=/usr/bin/python3' 'TAR=/bin/tar'
'XSLTPROC=xsltproc --nonet' 'CFLAGS=-g -O2 -fstack-protector-strong
-Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions
-Wl,-z,relro -Wl,-z,now' '--enable-tap-tests' '--with-icu' '--*with-llvm'
'LLVM_CONFIG=/usr/bin/llvm-config-10*' 'CLANG=/usr/bin/clang-10'
'--with-lz4' '--with-systemd' '--with-selinux'
'build_alias=aarch64-linux-gnu' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security'
```
There is no llvm installed on ubuntu server, postgresql was installed via
apt package `apt install postgresql-14`
On Mon, Nov 7, 2022 at 6:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
No, the database is running well, no problem until now after disabled
*jit.*
Interesting. Which version of LLVM is installed?
regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
Willian Colognesi <willian_colognesi@trimble.com> writes:
There is no llvm installed on ubuntu server, postgresql was installed via
apt package `apt install postgresql-14`
If there's no LLVM around, then disabling JIT wouldn't do anything,
because it depends on LLVM to compile code.
We should perhaps wait awhile to see if that really fixed it.
regards, tom lane
On Mon, Nov 7, 2022 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
`I take it things were okay with the version you used previously?`
Yes, it was working pretty well in another instance with pg version
`12.4-1.pgdg18.04+1`, and we had to make a migration of one database that
was running in this server to another using Logical Replication.12.4 to 14.5 is kind of a big jump :-(.
The stack trace seems to indicate that ExecProcNode transferred control
to never-never land, which says that something clobbered the function
pointer it's trying to indirect through. I don't recall having seen
any similar reports though.
I'm just thinking out loud... I've seen the latest GCC do that on what
it believes to be dead code. Our problem was detailed at
https://github.com/weidai11/cryptopp/issues/1141 .
We identified the problem by building/running our self tests with
-fsanitize=unreachable .
Testing with -fsanitize=unreachable should confirm or rule out GCC and
Clang [incorrectly] removing code that is actually needed. If this is
the problem, then -fsanitize=unreachable will also provide a usable
stack trace and provide a useful debugging experience.
Jeff
On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
<willian_colognesi@trimble.com> wrote:
root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...
There is no llvm installed on ubuntu server, postgresql was installed via apt package `apt install postgresql-14`
We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.
You are right Thomas,
Just confirmed and it's installed:
ubuntu@ip-10-x-x-x:~$ apt search llvm | grep inst
WARNING: apt does not have a stable CLI interface. Use with caution in
scripts.
libllvm10/focal,now 1:10.0.0-4ubuntu1 arm64 [installed,automatic]
I was trying something like `llvm -version` or something like that but did
not have success, but I verified, and in the apt is installed.
Tom,
Since yesterday the database hasn't restarted, so I'm believing that there
is some problem related to jit.
On Tue, Nov 8, 2022 at 4:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
<willian_colognesi@trimble.com> wrote:root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...There is no llvm installed on ubuntu server, postgresql was installed
via apt package `apt install postgresql-14`
We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
Looks like we can confirm that the jit disable fixed the problem, because
since yesterday when I disabled jit, the database did not restarted again,
and before it the database was restarting at least once per hour.
I don't think it will cause too much impact in our use case having it
disabled, so, if you need anything else that could help the analyses to
find the bug feel free to let me know and I can grab the logs or whatever
needed.
Thanks y'all
On Mon, Nov 7, 2022 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Willian Colognesi <willian_colognesi@trimble.com> writes:
There is no llvm installed on ubuntu server, postgresql was installed via
apt package `apt install postgresql-14`If there's no LLVM around, then disabling JIT wouldn't do anything,
because it depends on LLVM to compile code.We should perhaps wait awhile to see if that really fixed it.
regards, tom lane
--
<http://www.trimble.com/>
*Willian Cezar de O. Colognesi*
Systems Analysis Specialist, Trimble Transportation Brazil
Avenida Santos Dumont, 271 | Londrina, PR | 86039-090
Willian Colognesi <willian_colognesi@trimble.com> writes:
Looks like we can confirm that the jit disable fixed the problem, because
since yesterday when I disabled jit, the database did not restarted again,
and before it the database was restarting at least once per hour.
Hmm. I now recall that we had a previous report of problems with
JIT on aarch64/Focal:
/messages/by-id/20220303150428.GA26036@depesz.com
That was LLVM 9 not LLVM 10, but since we never identified the exact
issue, there's no real strong reason to suppose it's been fixed.
Probably keeping JIT off is the best answer for you --- it's hard to
say when we'll be able to make progress with this, given the lack of
reproducible test cases.
regards, tom lane