[TRAP: FailedAssertion] causing server to crash

Started by Neha Sharmaover 8 years ago19 messages
#1Neha Sharma
neha.sharma@enterprisedb.com
2 attachment(s)

Hi,

I am getting FailedAssertion while executing the attached script.However,I
am not able to produce the core dump for the same,the script runs in
background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG: server process (PID 28084) was
terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL: Failed process was running:
autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma

Attachments:

nohup.outapplication/octet-stream; name=nohup.outDownload
test_script.shapplication/x-sh; name=test_script.shDownload
#2Robert Haas
robertmhaas@gmail.com
In reply to: Neha Sharma (#1)
Re: [TRAP: FailedAssertion] causing server to crash

On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

I am getting FailedAssertion while executing the attached script.However,I
am not able to produce the core dump for the same,the script runs in
background and takes around a day time to produce the mentioned error.

Maybe you should run it for another day with core dumps enabled (try
the "-c" option to pg_ctl) and then pull a backtrace if you get one.
This is a pretty generic script so it's going to be hard to guess
where the problem is otherwise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: [TRAP: FailedAssertion] causing server to crash

Robert Haas wrote:

On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

I am getting FailedAssertion while executing the attached script.However,I
am not able to produce the core dump for the same,the script runs in
background and takes around a day time to produce the mentioned error.

Maybe you should run it for another day with core dumps enabled (try
the "-c" option to pg_ctl) and then pull a backtrace if you get one.
This is a pretty generic script so it's going to be hard to guess
where the problem is otherwise.

Also, please state server version used.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Craig Ringer
craig@2ndquadrant.com
In reply to: Neha Sharma (#1)
Re: [TRAP: FailedAssertion] causing server to crash

On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com> wrote:

Hi,

I am getting FailedAssertion while executing the attached script.However,I
am not able to produce the core dump for the same,the script runs in
background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG: server process (PID 28084) was
terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL: Failed process was running:
autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

It's not much of a stretch to assume this was likely introduced
by ea42cc18c:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ea42cc18c35381f639d45628d792e790ff39e271

so I'll have a look to see if I can spot how it could happen.

Please show your "SELECT version()" and if using a build from git, the "git
rev-parse --short HEAD" for the build you're running.

It's interesting and probably significant that your Pg was nearing
wraparound. Just a tip for that - if you want to create a situation where
you near wraparound quickly and easily, for testing, you can, on a test
instance you don't care about:

- select txid_current();
- stop pg
- pg_resetxlog -n $(( $the_xid + (1<<31) - 1001000 ))
- try to start pg, note the missing clog segment it complains about
- dd if=/dev/zero bs=1 count=262144 of=datadir/pg_clog/$MISSINGSEGNAME
- start Pg

That should put you about 1000 txn's from the 1 million xid limit, assuming
I got my maths right (don't assume that), and assuming your starting xid is
well short of the 2-million mark.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#5Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#4)
Re: [TRAP: FailedAssertion] causing server to crash

On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Hi,

I am getting FailedAssertion while executing the attached
script.However,I am not able to produce the core dump for the same,the
script runs in background and takes around a day time to produce the
mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG: server process (PID 28084) was
terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL: Failed process was running:
autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

What are the starting conditions of your postgres instance? Does your
script assume a newly initdb'd instance with no custom configuration? If
not, what setup steps/configuration precede your script run?

well short of the 2-million mark.

Er, billion.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Craig Ringer (#5)
Re: [TRAP: FailedAssertion] causing server to crash

Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I
was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Test code Snippet :

#Pre condition:
#Create and initialize the database and set the export PGDATA
export PGDATA=/home/centos/PG_sources/postgresql/inst/bin/data
export PGPORT=5432
export
LD_LIBRARY_PATH=/home/centos/PG_sources/postgresql/inst/lib:$LD_LIBRARY_PATH
#for i in "scale_factor shared_buffers time_for_readings no_of_readings
orig_or_patch"
for i in "300 8GB 1800 3"
do
scale_factor=`echo $i | cut -d" " -f1`
shared_bufs=`echo $i | cut -d" " -f2`
time_for_reading=`echo $i | cut -d" " -f3`
no_of_readings=`echo $i | cut -d" " -f4`

# -----------------------------------------------

echo "Start of script for $scale_factor $shared_bufs " >>
/home/centos/test_results.txt

echo "============== $run_bin =============" >>
/home/centos/test_results.txt

for threads in 1 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
#for threads in 8 16
do
#Start taking reading
for ((readcnt = 1 ; readcnt <= $no_of_readings ; readcnt++))
do
echo "================================================" >>
/home/centos/test_results.txt
echo $scale_factor, $shared_bufs, $threads, $threads,
$time_for_reading Reading - ${readcnt} >> /home/centos/test_results.txt
#start server
./pg_ctl -D data -c -l logfile start
#./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
sleep 5
#drop and recreate database
./dropdb test
./createdb test
#initialize database
./pgbench -i -s $scale_factor test
sleep 5
# Run pgbench
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
test >> /home/centos/test_results.txt
sleep 10
./psql -d test -c "checkpoint" >> /home/centos/test_results.txt
./pg_ctl stop
done;
done;

sleep 1

mv /home/centos/test_results.txt
/home/centos/test_results_list_${scale_factor}_${shared_bufs}_rw.txt
done;

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 11:23 AM, Craig Ringer <craig@2ndquadrant.com>
wrote:

Show quoted text

On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com>

wrote:

Hi,

I am getting FailedAssertion while executing the attached
script.However,I am not able to produce the core dump for the same,the
script runs in background and takes around a day time to produce the
mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG: server process (PID 28084) was
terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL: Failed process was running:
autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

What are the starting conditions of your postgres instance? Does your
script assume a newly initdb'd instance with no custom configuration? If
not, what setup steps/configuration precede your script run?

well short of the 2-million mark.

Er, billion.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Craig Ringer
craig@2ndquadrant.com
In reply to: Neha Sharma (#6)
Re: [TRAP: FailedAssertion] causing server to crash

On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com> wrote:

Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I
was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Thanks.

To verify that you do get a coredump, you might want to consider sending a
kill -SEGV to a backend and make sure that it actually dumps core and you
can find the core.

Ideally you'd actually set the coredumps to include shmem (see
coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
with 8GB shared_buffers that may not be practical. It'd be very useful if
possible.

If this is wraparound-related, as it appears to be, you might get faster
results by using a custom pgbench script for one or more workers that just
runs txid_current() a whole lot. Or jump the server's xid space forward.

I've got a few other things on right now but I'll keep an eye out and hope
for a core dump.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Craig Ringer (#7)
Re: [TRAP: FailedAssertion] causing server to crash

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I
was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Thanks.

To verify that you do get a coredump, you might want to consider sending a
kill -SEGV to a backend and make sure that it actually dumps core and you
can find the core.

Ideally you'd actually set the coredumps to include shmem (see
coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
with 8GB shared_buffers that may not be practical. It'd be very useful if
possible.

If this is wraparound-related, as it appears to be, you might get faster
results by using a custom pgbench script for one or more workers that just
runs txid_current() a whole lot. Or jump the server's xid space forward.

Thanks. Will put together suggestions to get the result.

Show quoted text

I've got a few other things on right now but I'll keep an eye out and hope
for a core dump.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Neha Sharma (#8)
Re: [TRAP: FailedAssertion] causing server to crash

Hi,

Attached is the core dump file received on PG 10beta2 version.

myfile.tgz
<https://drive.google.com/a/enterprisedb.com/file/d/0B4_zX36siXF0MHJTTjZHZjNtN1E/view?usp=drive_web&gt;

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 2:45 PM, Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Show quoted text

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:

On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com>
wrote:

Hi Craig,

I had done a fresh initdb,the default parameter configuration was used.
I was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Thanks.

To verify that you do get a coredump, you might want to consider sending
a kill -SEGV to a backend and make sure that it actually dumps core and you
can find the core.

Ideally you'd actually set the coredumps to include shmem (see
coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html),
but with 8GB shared_buffers that may not be practical. It'd be very useful
if possible.

If this is wraparound-related, as it appears to be, you might get faster
results by using a custom pgbench script for one or more workers that just
runs txid_current() a whole lot. Or jump the server's xid space forward.

Thanks. Will put together suggestions to get the result.

I've got a few other things on right now but I'll keep an eye out and
hope for a core dump.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Neha Sharma (#9)
Re: [TRAP: FailedAssertion] causing server to crash

On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Attached is the core dump file received on PG 10beta2 version.

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway? Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently. They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid. What am I missing here?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Neha Sharma
neha.sharma@enterprisedb.com
In reply to: Thomas Munro (#10)
Re: [TRAP: FailedAssertion] causing server to crash

Here is the back trace from the core dump attached.

(gdb) bt
#0 0x00007f4a71424495 in raise () from /lib64/libc.so.6
#1 0x00007f4a71425c75 in abort () from /lib64/libc.so.6
#2 0x00000000009dc18a in ExceptionalCondition (conditionName=0xa905d0
"!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))",
errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
lineNumber=683) at assert.c:54
#3 0x0000000000524215 in TruncateCLOG (oldestXact=150036635,
oldestxid_datoid=13164) at clog.c:682
#4 0x00000000006a6be8 in vac_truncate_clog (frozenXID=150036635,
minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
vacuum.c:1197
#5 0x00000000006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
#6 0x00000000007ce0a2 in do_autovacuum () at autovacuum.c:2625
#7 0x00000000007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
autovacuum.c:1715
#8 0x00000000007cc562 in StartAutoVacWorker () at autovacuum.c:1512
#9 0x00000000007e2acd in StartAutovacuumWorker () at postmaster.c:5414
#10 0x00000000007e257e in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5111
#11 <signal handler called>
#12 0x00007f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000007dde88 in ServerLoop () at postmaster.c:1717
#14 0x00000000007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
postmaster.c:1361
#15 0x000000000071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
(gdb) print ShmemVariableCache->oldestXid
$3 = 548

Regards,
Neha Sharma

On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:

Show quoted text

On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Attached is the core dump file received on PG 10beta2 version.

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway? Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently. They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid. What am I missing here?

--
Thomas Munro
http://www.enterprisedb.com

#12Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Neha Sharma (#11)
2 attachment(s)
Re: [TRAP: FailedAssertion] causing server to crash

At Fri, 21 Jul 2017 11:39:38 +0530, Neha Sharma <neha.sharma@enterprisedb.com> wrote in <CANiYTQuZm+hDvuHB14d65SkL2ko98ESR3Jf2kUiX=m1haL=xrg@mail.gmail.com>

Here is the back trace from the core dump attached.

(gdb) bt
#0 0x00007f4a71424495 in raise () from /lib64/libc.so.6
#1 0x00007f4a71425c75 in abort () from /lib64/libc.so.6
#2 0x00000000009dc18a in ExceptionalCondition (conditionName=0xa905d0
"!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))",
errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
lineNumber=683) at assert.c:54
#3 0x0000000000524215 in TruncateCLOG (oldestXact=150036635,
oldestxid_datoid=13164) at clog.c:682

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

By the way, the attached patch is made by "git diff --patience".

filterdiff converts it into somewhat wrong shape. Specifically,
the result is missing the addition part of the difference, as the
second attached patch. I'm not sure which of git(2.9.2) or
filterdiff (0.3.3), (or me?) is doing wrong..

#4 0x00000000006a6be8 in vac_truncate_clog (frozenXID=150036635,
minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
vacuum.c:1197
#5 0x00000000006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
#6 0x00000000007ce0a2 in do_autovacuum () at autovacuum.c:2625
#7 0x00000000007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
autovacuum.c:1715
#8 0x00000000007cc562 in StartAutoVacWorker () at autovacuum.c:1512
#9 0x00000000007e2acd in StartAutovacuumWorker () at postmaster.c:5414
#10 0x00000000007e257e in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5111
#11 <signal handler called>
#12 0x00007f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000007dde88 in ServerLoop () at postmaster.c:1717
#14 0x00000000007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
postmaster.c:1361
#15 0x000000000071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
(gdb) print ShmemVariableCache->oldestXid
$3 = 548

Regards,
Neha Sharma

On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:

On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:

Attached is the core dump file received on PG 10beta2 version.

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway? Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently. They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid. What am I missing here?

--
Thomas Munro
http://www.enterprisedb.com

--
堀口恭太郎

日本電信電話株式会社 NTTオープンソースソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490

Attachments:

truncateCLOG_after_advancing_xid_limits.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa1812..cd8be92 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1192,13 +1192,6 @@ vac_truncate_clog(TransactionId frozenXID,
 	AdvanceOldestCommitTsXid(frozenXID);
 
 	/*
-	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
-	 */
-	TruncateCLOG(frozenXID, oldestxid_datoid);
-	TruncateCommitTs(frozenXID);
-	TruncateMultiXact(minMulti, minmulti_datoid);
-
-	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
 	 * MultiXactIds.  Note: these functions will also signal the postmaster
 	 * for an(other) autovac cycle if needed.   XXX should we avoid possibly
@@ -1206,6 +1199,14 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	/*
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value
+	 * after advancing xid limits.
+	 */
+	TruncateCLOG(frozenXID, oldestxid_datoid);
+	TruncateCommitTs(frozenXID);
+	TruncateMultiXact(minMulti, minmulti_datoid);
 }
 
 
BROKEN_truncateCLOG_after_advancing_xid_limits.patchtext/x-patch; charset=us-asciiDownload
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 1192,1204 **** vac_truncate_clog(TransactionId frozenXID,
  	AdvanceOldestCommitTsXid(frozenXID);
  
  	/*
- 	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
- 	 */
- 	TruncateCLOG(frozenXID, oldestxid_datoid);
- 	TruncateCommitTs(frozenXID);
- 	TruncateMultiXact(minMulti, minmulti_datoid);
- 
- 	/*
  	 * Update the wrap limit for GetNewTransactionId and creation of new
  	 * MultiXactIds.  Note: these functions will also signal the postmaster
  	 * for an(other) autovac cycle if needed.   XXX should we avoid possibly
--- 1192,1197 ----
#13Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#10)
Re: [TRAP: FailedAssertion] causing server to crash

On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

This problem was introduced by commit
ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
the PostgreSQL 10 open items list. That commit intended to introduce a
distinction between (1) the oldest XID that can be safely examined and
(2) the oldest XID that can't yet be safely reused. These are the
same except when we're in the middle of truncating CLOG: (1) advances
before the truncation, and (2) advances afterwards. That's why
AdvanceOldestClogXid() happens before truncation proper and
SetTransactionIdLimit() happens afterwards, and changing the order
would, I think, be quite wrong.

AFAICS, that assertion is simply a holdover from an earlier version of
the patch that escaped review. There's just no reason to suppose that
it's true.

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway? Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently. They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid. What am I missing here?

Hmm, there could be a bug there, but I don't think it's *this* bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#13)
Re: [TRAP: FailedAssertion] causing server to crash

On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

This problem was introduced by commit
ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
the PostgreSQL 10 open items list. That commit intended to introduce a
distinction between (1) the oldest XID that can be safely examined and
(2) the oldest XID that can't yet be safely reused. These are the
same except when we're in the middle of truncating CLOG: (1) advances
before the truncation, and (2) advances afterwards. That's why
AdvanceOldestClogXid() happens before truncation proper and
SetTransactionIdLimit() happens afterwards, and changing the order
would, I think, be quite wrong.

Added to open items.

AFAICS, that assertion is simply a holdover from an earlier version of
the patch that escaped review. There's just no reason to suppose that
it's true.

Craig, are you planning to post a patch to remove the assertion, or
make it hold?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway? Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently. They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid. What am I missing here?

Hmm, there could be a bug there, but I don't think it's *this* bug.

I'm not sure that it wrong per se, as long as no one asserts that the
number can't go backwards...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Kyotaro HORIGUCHI (#12)
Re: [TRAP: FailedAssertion] causing server to crash

On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

If we keep the assertion by changing the order of changes to match the
comment like this, then don't we still have a problem if another
backend moves it backwards because of the data race I mentioned? That
too could be fixed (perhaps by teaching SetTransactionIdLimit not to
overwrite higher values), but it sounds like the assertion might be a
mistake.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#15)
Re: [TRAP: FailedAssertion] causing server to crash

On 7 August 2017 at 14:04, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

If we keep the assertion by changing the order of changes to match the
comment like this, then don't we still have a problem if another
backend moves it backwards because of the data race I mentioned? That
too could be fixed (perhaps by teaching SetTransactionIdLimit not to
overwrite higher values), but it sounds like the assertion might be a
mistake.
<http://www.enterprisedb.com&gt;

I think so - specifically, that it's a leftover from a revision where the
xid limit was advanced before clog truncation.

I'll be finding time in the next couple of days to look more closely and
ensure that's all it is.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#14)
Re: [TRAP: FailedAssertion] causing server to crash

On Mon, Aug 07, 2017 at 05:29:34PM +1200, Thomas Munro wrote:

On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Thanks Neha. It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG(). What am I missing here?

This problem was introduced by commit
ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
the PostgreSQL 10 open items list. That commit intended to introduce a
distinction between (1) the oldest XID that can be safely examined and
(2) the oldest XID that can't yet be safely reused. These are the
same except when we're in the middle of truncating CLOG: (1) advances
before the truncation, and (2) advances afterwards. That's why
AdvanceOldestClogXid() happens before truncation proper and
SetTransactionIdLimit() happens afterwards, and changing the order
would, I think, be quite wrong.

Added to open items.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#16)
Re: [TRAP: FailedAssertion] causing server to crash

On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

I think so - specifically, that it's a leftover from a revision where the
xid limit was advanced before clog truncation.

I'll be finding time in the next couple of days to look more closely and
ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion. That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#18)
Re: [TRAP: FailedAssertion] causing server to crash

On 10 August 2017 at 23:25, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer <craig@2ndquadrant.com>
wrote:

I think so - specifically, that it's a leftover from a revision where the
xid limit was advanced before clog truncation.

I'll be finding time in the next couple of days to look more closely and
ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion. That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

Thanks.

Keeps getting bumped down by other things, but it's on the stack to not get
lost

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services