BUG #13985: Segmentation fault on PREPARE TRANSACTION

Started by Chris Tesselsabout 10 years ago14 messagesbugs
Jump to latest
#1Chris Tessels
chris.tessels@inergy.nl

The following bug has been logged on the website:

Bug reference: 13985
Logged by: Chris Tessels
Email address: chris.tessels@inergy.nl
PostgreSQL version: 9.4.6
Operating system: CentOS release 6.6 (Final)
Description:

The first time we saw this segmentation fault was on an instance running
PostgreSQL 9.4.4.
To make sure the problem wasn't already fixed, we upgraded another instance
on identical hardware to PostgreSQL 9.4.6.
This would also exclude potential hardware related problems like bad
memory.
We also updated our JDBC driver to the latest version:
https://jdbc.postgresql.org/download/postgresql-9.4.1208.jar

Although we were able to make both systems crash, we were not able to
systematically reproduce the problem.

Here is the postgresql.log when the segfault occurred.

/data/postgres/pg_log/postgresql.log
2016-02-23 14:45:12.901 CET LOG: server process (PID 519) was terminated
by signal 11: Segmentation fault
2016-02-23 14:45:12.901 CET DETAIL: Failed process was running: PREPARE
TRANSACTION
'131077_AAAAAAAAAAAAAP//CjIGBaCG30NWzFRpAdtPATE=_AAAAAAAAAAAAAP//CjIGBaCG30NWzFRpAdtP4AAAAAIAAAAA'
2016-02-23 14:45:12.901 CET LOG: terminating any other active server
processes
2016-02-23 14:45:12.902 CET LOG: archiver process (PID 114131) exited
with exit code 1
2016-02-23 14:45:14.226 CET replication_user FATAL: the database system is
in recovery mode
2016-02-23 14:45:14.564 CET LOG: all server processes terminated;
reinitializing
2016-02-23 14:45:17.404 CET LOG: database system was interrupted; last
known up at 2016-02-23 14:31:36 CET
2016-02-23 14:53:06.753 CET mailinfo_ow FATAL: the database system is in
recovery mode
2016-02-23 14:53:06.760 CET mailinfo_ow FATAL: the database system is in
recovery mode
2016-02-23 14:53:08.830 CET replication_user FATAL: the database system is
in recovery mode
2016-02-23 14:53:10.203 CET LOG: record with zero length at 5/1409A540
2016-02-23 14:53:10.203 CET LOG: redo done at 5/1409A3C0
2016-02-23 14:53:10.203 CET LOG: last completed transaction was at log
time 2016-02-23 14:44:32.012791+01
2016-02-23 14:53:11.708 CET LOG: MultiXact member wraparound protections
are now enabled
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609585
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609594
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609596
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609601
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609591
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609588
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609593
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609584
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609595
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609586
2016-02-23 14:53:11.710 CET LOG: recovering prepared transaction
12609590
2016-02-23 14:53:11.756 CET LOG: autovacuum launcher started
2016-02-23 14:53:11.757 CET LOG: database system is ready to accept
connections

/var/log/messages
Feb 23 14:44:32 [hostname] kernel: postmaster[519]: segfault at
7fc2d8e6b634 ip 000000000066a95b sp 00007fff5ca887d0 error 4 in
postgres[400000+566000]

Analyzing the coredump gave us the following information:
sudo -u postgres gdb -q -c
/var/spool/abrt/ccpp-2016-02-23-16\:27\:50-77566/coredump
/usr/pgsql-9.4/bin/postgres

Core was generated by `postgres: mailinfo_ow mailinfo_ods 10.50.6.6(4188'.
Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

Postgresql version:
PostgreSQL 9.4.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit

Operating system and version:
[root@[hostname] ccpp-2016-02-23-16:27:50-77566]# cat /etc/redhat-release
CentOS release 6.6 (Final)
[root@[hostname] ccpp-2016-02-23-16:27:50-77566]# uname -a
Linux [hostname] 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux

We installed Postgres using the following command:
yum install from http://yum.postgresql.org/9.4/redhat/rhel-6.6-x86_64/

Our configuration settings are:
name current_setting source
archive_command rsync -a %p /data/postgres_archive/%f configuration file
archive_mode on configuration file
bgwriter_delay 100ms configuration file
bgwriter_lru_maxpages 1000 configuration file
checkpoint_completion_target 0 configuration file
checkpoint_segments 192 configuration file
checkpoint_timeout 1h configuration file
client_encoding UTF8 client
commit_delay 50 configuration file
commit_siblings 50 configuration file
DateStyle ISO, MDY client
default_text_search_config pg_catalog.english configuration file
dynamic_shared_memory_type posix configuration file
effective_cache_size 16GB configuration file
effective_io_concurrency 1 configuration file
extra_float_digits 3 session
fsync on configuration file
full_page_writes off configuration file
hot_standby on configuration file
hot_standby_feedback on configuration file
lc_messages en_US.UTF-8 configuration file
lc_monetary en_US.UTF-8 configuration file
lc_numeric en_US.UTF-8 configuration file
lc_time en_US.UTF-8 configuration file
listen_addresses * configuration file
log_destination stderr configuration file
log_line_prefix %m %u configuration file
log_min_messages log configuration file
log_rotation_age 0 configuration file
log_rotation_size 1000MB configuration file
log_statement none configuration file
log_timezone CET configuration file
log_truncate_on_rotation on configuration file
logging_collector on configuration file
maintenance_work_mem 2GB configuration file
max_connections 400 configuration file
max_prepared_transactions 100 configuration file
max_stack_depth 2MB environment variable
max_standby_archive_delay 2min configuration file
max_standby_streaming_delay 2min configuration file
max_wal_senders 3 configuration file
port 5432 configuration file
shared_buffers 32GB configuration file
synchronous_commit off configuration file
temp_buffers 8MB configuration file
TimeZone Europe/Amsterdam client
wal_level hot_standby configuration file
work_mem 64MB configuration file

The program using to connect to PostgreSQL:
Wildfly 8.2.0 XA datasource with driver
https://jdbc.postgresql.org/download/postgresql-9.4.1208.jar

wildfly config:

<xa-datasource jndi-name="java:jboss/datasources/MailInfoXADS"
pool-name="MailInfoXADS" enabled="true" use-java-context="true">
<xa-datasource-property name="ServerName">
[hostname]
</xa-datasource-property>
<xa-datasource-property name="PortNumber">
5432
</xa-datasource-property>
<xa-datasource-property name="DatabaseName">
mailinfo_ods
</xa-datasource-property>
<driver>postgresql-jdbc4</driver>
<xa-pool>
<min-pool-size>5</min-pool-size>
<initial-pool-size>5</initial-pool-size>
<max-pool-size>40</max-pool-size>
<prefill>true</prefill>
</xa-pool>
<security>
<user-name>mailinfo_ow</user-name>
<password>xxxxxx</password>
</security>
<validation>
<valid-connection-checker
class-name="org.jboss.jca.adapters.jdbc.extensions.postgres.PostgreSQLValidConnectionChecker"/>
<exception-sorter
class-name="org.jboss.jca.adapters.jdbc.extensions.postgres.PostgreSQLExceptionSorter"/>
</validation>
</xa-datasource>

<drivers>
<driver name="postgresql-jdbc4" module="org.postgresql">
<xa-datasource-class>org.postgresql.xa.PGXADataSource</xa-datasource-class>
</driver>
</drivers>

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Chris Tessels (#1)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

chris.tessels@inergy.nl wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods 10.50.6.6(4188'.
Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

It's not surprising that you're not able to make this crash
consistently, because it looks like the problem might be in concurrent
modifications to the PGXACT array. This routine, MinimumActiveBackends,
walks the PGPROC array explicitely without locks. There are comments
indicating that this is safe, but evidently something has slipped in
there.

Apparently this code is trying to dereference an invalid pgxact, but
it's not clear to me how this happens. Those structs are allocated in
advance, and they are referenced in the code via array indexes, so even
if the pgxact doesn't actually hold data about a valid transaction,
dereferencing the XID shouldn't cause a crash.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#2)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On 2016-02-24 17:52:37 -0300, Alvaro Herrera wrote:

chris.tessels@inergy.nl wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods 10.50.6.6(4188'.
Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

It's not surprising that you're not able to make this crash
consistently, because it looks like the problem might be in concurrent
modifications to the PGXACT array. This routine, MinimumActiveBackends,
walks the PGPROC array explicitely without locks. There are comments
indicating that this is safe, but evidently something has slipped in
there.

Apparently this code is trying to dereference an invalid pgxact, but
it's not clear to me how this happens. Those structs are allocated in
advance, and they are referenced in the code via array indexes, so even
if the pgxact doesn't actually hold data about a valid transaction,
dereferencing the XID shouldn't cause a crash.

Well, that code is pretty, uh, questionable. E.g. for
int pgprocno = arrayP->pgprocnos[index];
volatile PGPROC *proc = &allProcs[pgprocno];
volatile PGXACT *pgxact = &allPgXact[pgprocno];
there's no guarantee that pgprocno is actually the same index for both
lookups and the following
if (pgprocno == -1)
continue; /* do not count deleted entries */
check. It's perfectly reasonable for a compiler to reload pgprocno from
memory, or just always reference it via memory.

I presume what happened here is that initially arrayP->pgprocnos[index]
was -1, but by the time if (pgprocno == -1) is reached, it changed to a
different value.

It's also really crummy that we're doing the PGPROC/PGXACT lookups
before checking whether pgprocno is -1.

At the very least ISTM that we have to make pgprocno volatile (or use a
memory barrier - but we don't have sufficient support for those in the
older branches), and move the PGPROC/PGXACT lookups after the == -1
check.

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Andres Freund (#3)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-24 17:52:37 -0300, Alvaro Herrera wrote:

chris.tessels@inergy.nl wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods

10.50.6.6(4188'.

Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

It's not surprising that you're not able to make this crash
consistently, because it looks like the problem might be in concurrent
modifications to the PGXACT array. This routine, MinimumActiveBackends,
walks the PGPROC array explicitely without locks. There are comments
indicating that this is safe, but evidently something has slipped in
there.

Apparently this code is trying to dereference an invalid pgxact, but
it's not clear to me how this happens. Those structs are allocated in
advance, and they are referenced in the code via array indexes, so even
if the pgxact doesn't actually hold data about a valid transaction,
dereferencing the XID shouldn't cause a crash.

Well, that code is pretty, uh, questionable. E.g. for
int pgprocno =
arrayP->pgprocnos[index];
volatile PGPROC *proc = &allProcs[pgprocno];
volatile PGXACT *pgxact = &allPgXact[pgprocno];
there's no guarantee that pgprocno is actually the same index for both
lookups and the following
if (pgprocno == -1)
continue; /* do not count
deleted entries */
check. It's perfectly reasonable for a compiler to reload pgprocno from
memory, or just always reference it via memory.

Hm... this is against my understanding of what a compiler could (or
should) do. Do you have a documentation reference or other evidence?

A naive disassemble dump on-my-box(tm) suggests that pgprocno is stored on
stack (with offset -0x1c) and is referenced from there:
...
0x00000000007f8011 <+53>: mov -0x18(%rbp),%rax
0x00000000007f8015 <+57>: mov -0x20(%rbp),%edx
0x00000000007f8018 <+60>: movslq %edx,%rdx
0x00000000007f801b <+63>: add $0x8,%rdx
0x00000000007f801f <+67>: mov 0x8(%rax,%rdx,4),%eax
0x00000000007f8023 <+71>: mov %eax,-0x1c(%rbp)
0x00000000007f8026 <+74>: mov 0x67b15b(%rip),%rdx # 0xe73188
<allProcs>
0x00000000007f802d <+81>: mov -0x1c(%rbp),%eax # <== here
0x00000000007f8030 <+84>: cltq
0x00000000007f8032 <+86>: imul $0x338,%rax,%rax
0x00000000007f8039 <+93>: add %rdx,%rax
0x00000000007f803c <+96>: mov %rax,-0x10(%rbp)
0x00000000007f8040 <+100>: mov 0x67b149(%rip),%rcx # 0xe73190
<allPgXact>
0x00000000007f8047 <+107>: mov -0x1c(%rbp),%eax # <== here
0x00000000007f804a <+110>: movslq %eax,%rdx
0x00000000007f804d <+113>: mov %rdx,%rax
0x00000000007f8050 <+116>: add %rax,%rax
0x00000000007f8053 <+119>: add %rdx,%rax
0x00000000007f8056 <+122>: shl $0x2,%rax
0x00000000007f805a <+126>: add %rcx,%rax
0x00000000007f805d <+129>: mov %rax,-0x8(%rbp)
0x00000000007f8061 <+133>: cmpl $0xffffffff,-0x1c(%rbp) # <== and
here
0x00000000007f8065 <+137>: je 0x7f80a4 <MinimumActiveBackends+200>
...

I presume what happened here is that initially arrayP->pgprocnos[index]

was -1, but by the time if (pgprocno == -1) is reached, it changed to a
different value.

It's also really crummy that we're doing the PGPROC/PGXACT lookups
before checking whether pgprocno is -1.

No doubt here.

At the very least ISTM that we have to make pgprocno volatile (or use a

memory barrier - but we don't have sufficient support for those in the
older branches), and move the PGPROC/PGXACT lookups after the == -1
check.

Use of volatile doesn't change the resulting code dramatically for me.

The above is when compiling without optimizations. With -Og I'm getting
similar code for the variant with volatile and if no volatile, pgprocno is
just loaded into a register as one would expect.

--
Alex

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#4)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de> wrote:
At the very least ISTM that we have to make pgprocno volatile (or use a

memory barrier - but we don't have sufficient support for those in the
older branches), and move the PGPROC/PGXACT lookups after the == -1
check.

Use of volatile doesn't change the resulting code dramatically for me.

Marking pgprocno volatile is silly. What *is* missing is this:

-	ProcArrayStruct *arrayP = procArray;
+	volatile ProcArrayStruct *arrayP = procArray;

which corresponds directly to what the problem is: the storage arrayP
is pointing at may change asynchronously.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tom Lane (#5)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On Thu, Feb 25, 2016 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de>

wrote:

At the very least ISTM that we have to make pgprocno volatile (or use a

memory barrier - but we don't have sufficient support for those in the
older branches), and move the PGPROC/PGXACT lookups after the == -1
check.

Use of volatile doesn't change the resulting code dramatically for me.

Marking pgprocno volatile is silly. What *is* missing is this:

-       ProcArrayStruct *arrayP = procArray;
+       volatile ProcArrayStruct *arrayP = procArray;

which corresponds directly to what the problem is: the storage arrayP
is pointing at may change asynchronously.

Right, this makes a lot more sense.

--
Alex

#7Andres Freund
andres@anarazel.de
In reply to: Chris Tessels (#1)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

Hi,

On 2016-02-24 12:58:24 +0000, chris.tessels@inergy.nl wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods 10.50.6.6(4188'.
Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

Could you get us a full backtrace here? It's interesting that you're in
an XLogFlush()...

Postgresql version:
commit_delay 50 configuration file
commit_siblings 50 configuration file

FWIW, these look unlikely to be beneficial. How did you end up with
those parameters?

Greetings,

Andres Freund

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On 2016-02-25 09:51:49 -0500, Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de> wrote:
At the very least ISTM that we have to make pgprocno volatile (or use a

memory barrier - but we don't have sufficient support for those in the
older branches), and move the PGPROC/PGXACT lookups after the == -1
check.

Use of volatile doesn't change the resulting code dramatically for me.

Marking pgprocno volatile is silly. What *is* missing is this:

-	ProcArrayStruct *arrayP = procArray;
+	volatile ProcArrayStruct *arrayP = procArray;

Well, that'll also force arrayP->numProcs to be loaded from memory every
loop iteration. Not sure if we really want that. Otherwise, yes, that's
pretty much what I'm suggesting, although I think a temporary volatile
cast when loading pgprocno is better here. The important part is to
force that pgprocno is loaded *once* *before* the checks.

What bothers me about this right now is that we currently write the
pgprocno array with:
memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
(arrayP->numProcs - index) * sizeof(int));
arrayP->pgprocnos[index] = proc->pgprocno;
arrayP->numProcs++;
afaics there's absolutely no guarantee that memmov() will only use
aligned sizeof(int) writes. Sure, efficiency reasons make that a sane
choice, but a memmove() doing sizeof(char) wide moves would still be
correct.

Which afaics can mean we end up with completely bogus pgprocno
values. Consider e.g. 0x0000ff00 being moved where previously 0x000000ff
was set - we could temporarily end up with 0x0000ffff; which might be
above MaxBackends.

Regards,

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Andres Freund
andres@anarazel.de
In reply to: Shulgin, Oleksandr (#4)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On 2016-02-25 09:02:07 +0100, Shulgin, Oleksandr wrote:

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-24 17:52:37 -0300, Alvaro Herrera wrote:

chris.tessels@inergy.nl wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods

10.50.6.6(4188'.

Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

It's not surprising that you're not able to make this crash
consistently, because it looks like the problem might be in concurrent
modifications to the PGXACT array. This routine, MinimumActiveBackends,
walks the PGPROC array explicitely without locks. There are comments
indicating that this is safe, but evidently something has slipped in
there.

Apparently this code is trying to dereference an invalid pgxact, but
it's not clear to me how this happens. Those structs are allocated in
advance, and they are referenced in the code via array indexes, so even
if the pgxact doesn't actually hold data about a valid transaction,
dereferencing the XID shouldn't cause a crash.

Well, that code is pretty, uh, questionable. E.g. for
int pgprocno =
arrayP->pgprocnos[index];
volatile PGPROC *proc = &allProcs[pgprocno];
volatile PGXACT *pgxact = &allPgXact[pgprocno];
there's no guarantee that pgprocno is actually the same index for both
lookups and the following
if (pgprocno == -1)
continue; /* do not count
deleted entries */
check. It's perfectly reasonable for a compiler to reload pgprocno from
memory, or just always reference it via memory.

Hm... this is against my understanding of what a compiler could (or
should) do. Do you have a documentation reference or other evidence?

Which part does not conform to your expectations? Moving stores/loads
around from where they're apparently happening in the C program?
Repeatedly reading from memory instead of storing something on the
stack?

All of those are side effect free for single-threaded programs, which
pretty much is the gold standard for doing optimizations in C (at least
pre-C11, but even there the above all would be possible).

Regards,

Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

Andres Freund <andres@anarazel.de> writes:

On 2016-02-25 09:51:49 -0500, Tom Lane wrote:

Marking pgprocno volatile is silly. What *is* missing is this:

-	ProcArrayStruct *arrayP = procArray;
+	volatile ProcArrayStruct *arrayP = procArray;

Well, that'll also force arrayP->numProcs to be loaded from memory every
loop iteration. Not sure if we really want that.

I think we do. The entire point here is not to assume that that storage
isn't changing.

What bothers me about this right now is that we currently write the
pgprocno array with:
memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
(arrayP->numProcs - index) * sizeof(int));
arrayP->pgprocnos[index] = proc->pgprocno;
arrayP->numProcs++;
afaics there's absolutely no guarantee that memmov() will only use
aligned sizeof(int) writes.

Ugh. That's a separate problem, but yes, it's a problem.

Seems like we can either (1) get rid of that memmove in favor of
a handwritten loop, or (2) give up on unlocked access to the
pgprocnos array. Which performance hit would you rather take?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

Hi,

On 2016-02-25 12:20:06 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2016-02-25 09:51:49 -0500, Tom Lane wrote:

Marking pgprocno volatile is silly. What *is* missing is this:

-	ProcArrayStruct *arrayP = procArray;
+	volatile ProcArrayStruct *arrayP = procArray;

Well, that'll also force arrayP->numProcs to be loaded from memory every
loop iteration. Not sure if we really want that.

I think we do. The entire point here is not to assume that that storage
isn't changing.

Well, but what does it buy us? Given there's no locks involved, the
index < arrayP->numProcs check fundamentally cannot be anything than an
optimization, preventing us from looking at all PGPROC/PGXACT entries,
no? Reloading it at every loop iteration doesn't seem to give us
anything other than some sense of false security? It's a) perfectly
possible that numProcs is outdated because we're hitting a local
cacheline whereas another backend has it modified locally (store
buffers) b) that it's decremented after we entered the loop.

What bothers me about this right now is that we currently write the
pgprocno array with:
memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
(arrayP->numProcs - index) * sizeof(int));
arrayP->pgprocnos[index] = proc->pgprocno;
arrayP->numProcs++;
afaics there's absolutely no guarantee that memmov() will only use
aligned sizeof(int) writes.

Ugh. That's a separate problem, but yes, it's a problem.

While it's a separate bug, it seems to me that this could just as well
be the reason for crashes triggering this report :/. If pgprocno points
behind the end of the allPgXact array, it's quite possible that we're
reading zeroes :(

Seems like we can either (1) get rid of that memmove in favor of
a handwritten loop, or (2) give up on unlocked access to the
pgprocnos array. Which performance hit would you rather take?

I think 1), while it'll be noticeable, it's probably going to degrade
much more gracefully.

I think we additionally also should add a security check to
MinimumActiveBackends that prevents a pgprocno above the max number of
backends to be seen as valid.

- Andres

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#12Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Andres Freund (#9)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

On Thu, Feb 25, 2016 at 6:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-25 09:02:07 +0100, Shulgin, Oleksandr wrote:

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres@anarazel.de>

wrote:

Hm... this is against my understanding of what a compiler could (or
should) do. Do you have a documentation reference or other evidence?

Which part does not conform to your expectations? Moving stores/loads
around from where they're apparently happening in the C program?

Repeatedly reading from memory instead of storing something on the
stack?

This one. But now that I think about it, this can buy you a free register,
so yeah it can be an optimization in some contexts.

How does compiler know this is not multi-threaded program? Only because we
don't specify any -pthread flags (assuming gcc)? Or it doesn't know/care
at all and we have to specifically turn off thread-unsafe optimizations?

--
Alex

#13Chris Tessels
chris.tessels@inergy.nl
In reply to: Andres Freund (#7)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

Andres Freund wrote

Hi,

On 2016-02-24 12:58:24 +0000,

chris.tessels@

wrote:

Core was generated by `postgres: mailinfo_ow mailinfo_ods
10.50.6.6(4188'.
Program terminated with signal 11, Segmentation fault.

#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)

Could you get us a full backtrace here? It's interesting that you're in
an XLogFlush()...

Postgresql version:
commit_delay 50 configuration file
commit_siblings 50 configuration file

FWIW, these look unlikely to be beneficial. How did you end up with
those parameters?

Greetings,

Andres Freund

--
Sent via pgsql-bugs mailing list (

pgsql-bugs@

)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hi Andres,

The full backtrace:

Core was generated by `postgres: mailinfo_subscription_ow mailinfo_subsc'.
Program terminated with signal 11, Segmentation fault.
#0 MinimumActiveBackends (min=50) at procarray.c:2472
2472 if (pgxact->xid == InvalidTransactionId)
(gdb) bt
#0 MinimumActiveBackends (min=50) at procarray.c:2472
#1 0x00000000004c5bfb in XLogFlush (record=79323083800) at xlog.c:2856
#2 0x00000000004bea27 in EndPrepare (gxact=0x7fc2ccf22a00) at
twophase.c:1125
#3 0x00000000004b6b02 in PrepareTransaction () at xact.c:2222
#4 0x00000000004b6e07 in CommitTransactionCommand () at xact.c:2738
#5 0x0000000000685719 in finish_xact_command () at postgres.c:2437
#6 0x00000000006899b0 in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, dbname=0x24e2c30 "mailinfo_subscription_ods",
username=<value optimized out>) at postgres.c:1974
#7 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
dbname=0x24e2c30 "mailinfo_subscription_ods", username=<value optimized
out>) at postgres.c:4142
#8 0x0000000000634929 in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:4285
#9 BackendStartup (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:3948
#10 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at
postmaster.c:1679
#11 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1287
#12 0x00000000005cc248 in main (argc=3, argv=0x24b87e0) at main.c:228

I'm not sure why we deviated from the default settings for 'commit_delay'
and 'commit_siblings'.
I went through our documentation and could not find an entry why this
setting was changed.
We are running a test with the default settings to see if this has any
impact.

Regards, Chris

--
View this message in context: http://postgresql.nabble.com/BUG-13985-Segmentation-fault-on-PREPARE-TRANSACTION-tp5889158p5889384.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

Andres Freund <andres@anarazel.de> writes:

On 2016-02-25 12:20:06 -0500, Tom Lane wrote:

Seems like we can either (1) get rid of that memmove in favor of
a handwritten loop, or (2) give up on unlocked access to the
pgprocnos array. Which performance hit would you rather take?

I think 1), while it'll be noticeable, it's probably going to degrade
much more gracefully.

I think we additionally also should add a security check to
MinimumActiveBackends that prevents a pgprocno above the max number of
backends to be seen as valid.

So we have three proposals on the table:
* volatile-ize arrayP
* range-check the pgprocno after we fetch it
* replace memmove(s) with int-at-a-time move loop(s)

I wonder whether we couldn't leave the memmoves alone if we do the other
two things. It would mean that MinimumActiveBackends would sometimes
miss examining some backends, but that's going to happen anyway given
its use of unlocked access.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs