BUG #14295: Hot standby crash during tsvector rebuild

Started by Spencer Thomasonover 9 years ago11 messagesbugs
Jump to latest
#1Spencer Thomason
spencer@whiteskycommunications.com

The following bug has been logged on the website:

Bug reference: 14295
Logged by: Spencer Thomason
Email address: spencer@whiteskycommunications.com
PostgreSQL version: 9.4.8
Operating system: Solaris 11.3 SPARC
Description:

Hello,
We have a fax server application that performs OCR and uses full text
search. We have a nightly cron job that purges the stored text for old
records to save space. The table contains approx 60K rows and the job
deletes the text from around 100-150 records each night and then updates a
tsvector column. This database is replicated to 4 hot standby slaves, 2 in
the same data center and 2 in a remote location. We have observed a few
crashes of all the hot standby server when this cron job runs. All replicas
crash at the same time with the following log entries below. The master
appears unaffected and continues to function normally. Unfortunately I'm
unable to reproduce at will but I have observed this with 9.5.3 then we
downgraded to 9.4.8 and the crash just happened again.

Logs from one of the replicas is below:
2016-08-26 06:01:50 UTC FATAL: unexpected GIN leaf action: 0
2016-08-26 06:01:50 UTC CONTEXT: xlog redo Insert item, node:
1663/16387/33108 blkno: 6622 isdata: T isleaf: T 3 segments: 2 (add 0 items)
0 unknown action 0 ???
2016-08-26 06:01:50 UTC LOG: startup process (PID 19593) exited with exit
code 1
2016-08-26 06:01:50 UTC LOG: terminating any other active server
processes
2016-08-26 06:01:50 UTC WARNING: terminating connection because of crash of
another server process

The table schema is:
Column | Type | Modifiers
--------------+--------------------------+-----------
id | uuid | not null
fax_did_id | character varying(64) | not null
file_path | character varying(255) |
time_stamp | timestamp with time zone | not null
remote_id | character varying(60) |
caller_id | character varying(40) |
num_pages | integer |
ocr_text | text |
search_index | tsvector |
account_id | integer |
Indexes:
"fax_rxfax_pkey" PRIMARY KEY, btree (id)
"fax_rxfax_account_id" btree (account_id)
"fax_rxfax_fax_did_id" btree (fax_did_id)
"fax_rxfax_fax_did_id_like" btree (fax_did_id varchar_pattern_ops)
"fax_rxfax_search_index" gin (search_index)
Foreign-key constraints:
"fax_rxfax_account_id_fkey" FOREIGN KEY (account_id) REFERENCES
accounts(id) DEFERRABLE INITIALLY DEFERRED
"fax_rxfax_fax_did_id_fkey" FOREIGN KEY (fax_did_id) REFERENCES
dids(number) DEFERRABLE INITIALLY DEFERRED

Please let me know if there is any other info I can provide.

Thanks,
Spencer

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Spencer Thomason (#1)
Re: BUG #14295: Hot standby crash during tsvector rebuild

spencer@whiteskycommunications.com writes:

Logs from one of the replicas is below:
2016-08-26 06:01:50 UTC FATAL: unexpected GIN leaf action: 0
2016-08-26 06:01:50 UTC CONTEXT: xlog redo Insert item, node:
1663/16387/33108 blkno: 6622 isdata: T isleaf: T 3 segments: 2 (add 0 items)
0 unknown action 0 ???

Hmm, we have seen a couple of reports of that recently but have not been
able to track it down. Can you provide more details about what you're
doing that triggers it? Maybe even a self-contained test case? It
doesn't have to be one that fails every time, as long as it'll fail
occasionally.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Spencer Thomason
spencer@whiteskycommunications.com
In reply to: Tom Lane (#2)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Hi Tom,
I'm working on labbing this up and hopefully I can replicate it outside of our production environment.

We have a text column that contains fair amount of text (e.g. generated from a fax of maybe 10-25 pages) and then a tsvector column of that text with a gin index. To improve performance, we delete text on the old records nightly.

This appears to be related to number of records updated and the size of the update to the tsvector column. Hopefully I can provide more details and a test case soon.

Thanks,
Spencer

On Aug 26, 2016, at 4:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

spencer@whiteskycommunications.com writes:

Logs from one of the replicas is below:
2016-08-26 06:01:50 UTC FATAL: unexpected GIN leaf action: 0
2016-08-26 06:01:50 UTC CONTEXT: xlog redo Insert item, node:
1663/16387/33108 blkno: 6622 isdata: T isleaf: T 3 segments: 2 (add 0 items)
0 unknown action 0 ???

Hmm, we have seen a couple of reports of that recently but have not been
able to track it down. Can you provide more details about what you're
doing that triggers it? Maybe even a self-contained test case? It
doesn't have to be one that fails every time, as long as it'll fail
occasionally.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Spencer Thomason
spencer@whiteskycommunications.com
In reply to: Spencer Thomason (#3)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Hi Tom,
I have a python script which simulates our production environment and so far I have been able to replicate the failure on demand. What is the best way to make this available?

Also, I can make this sandbox environment available for remote access if that would help.

Thanks!
Spencer

On Aug 29, 2016, at 10:32 AM, Spencer Thomason <spencer@whiteskycommunications.com> wrote:

Hi Tom,
I'm working on labbing this up and hopefully I can replicate it outside of our production environment.

We have a text column that contains fair amount of text (e.g. generated from a fax of maybe 10-25 pages) and then a tsvector column of that text with a gin index. To improve performance, we delete text on the old records nightly.

This appears to be related to number of records updated and the size of the update to the tsvector column. Hopefully I can provide more details and a test case soon.

Thanks,
Spencer

On Aug 26, 2016, at 4:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

spencer@whiteskycommunications.com writes:

Logs from one of the replicas is below:
2016-08-26 06:01:50 UTC FATAL: unexpected GIN leaf action: 0
2016-08-26 06:01:50 UTC CONTEXT: xlog redo Insert item, node:
1663/16387/33108 blkno: 6622 isdata: T isleaf: T 3 segments: 2 (add 0 items)
0 unknown action 0 ???

Hmm, we have seen a couple of reports of that recently but have not been
able to track it down. Can you provide more details about what you're
doing that triggers it? Maybe even a self-contained test case? It
doesn't have to be one that fails every time, as long as it'll fail
occasionally.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Spencer Thomason (#4)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Spencer Thomason <spencer@whiteskycommunications.com> writes:

I have a python script which simulates our production environment and so far I have been able to replicate the failure on demand. What is the best way to make this available?

If it's not too large, posting it to this list would be great (we like to
have archived documentation about bugs).

If it is big, or you would rather not make it public, you can send it
to me off-list.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Spencer Thomason
spencer@whiteskycommunications.com
In reply to: Tom Lane (#5)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Please see the attachments.

I have included the master and slave configs and python scripts that trigger this on our systems.

The way to reproduce this are:
- create a new database
- setup streaming replication to 2 or more slaves
- run the populate_db.py script
- if slaves do not crash, run the purge_db.py script followed by the populate_db.py script again.

I didn’t see failures immediately but within 15-20 mins of inserting records, the slaves crashed.

Also, I should note that this is on a SPARC T2 at 1.4GHz so the single thread performance might be a factor as well.

Attachments:

create-db.sqlapplication/octet-stream; name=create-db.sqlDownload
postgresql-slave.confapplication/octet-stream; name=postgresql-slave.confDownload
postgresql-master.confapplication/octet-stream; name=postgresql-master.confDownload
words.txttext/plain; name=words.txtDownload
populate_db.pytext/x-python-script; name=populate_db.pyDownload
purge_db.pytext/x-python-script; name=purge_db.pyDownload
ATT00001.txttext/plain; name=ATT00001.txtDownload
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Spencer Thomason (#6)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Spencer Thomason <spencer@whiteskycommunications.com> writes:

I have included the master and slave configs and python scripts that trigger this on our systems.

Thanks for sending this. Unfortunately I've had zero success reproducing
the problem so far. I had about run out of ideas as to what might explain
why it fails for you and not me, when I noticed this:

Also, I should note that this is on a SPARC T2 at 1.4GHz so the single thread performance might be a factor as well.

Am I right in thinking that is a 32-bit machine? That might have
something to do with it (notably because of different maxalign).
Is the production machine you originally saw the problem on also 32-bit?

While I'm asking questions, could you send along the output of pg_config
for the build you're using?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Spencer Thomason
spencer@whiteskycommunications.com
In reply to: Tom Lane (#7)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Hi Tom,
This a 64 bit machine and I’m happy to provide remote access if that helps. Requested info below. Thanks for the follow up!

$ uname -a
SunOS pg-test-1.wtsky.net 5.11 11.3 sun4v sparc SUNW,T5240 Solaris

$ pg_config
BINDIR = /usr/postgresql/9.4/bin
DOCDIR = /usr/share/doc/postgresql-doc-9.4
HTMLDIR = /usr/share/doc/postgresql-doc-9.4
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/server
LIBDIR = /usr/lib/sparcv9
PKGLIBDIR = /usr/lib/sparcv9/postgresql
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/9.4/man
SHAREDIR = /usr/share/postgresql/9.4
SYSCONFDIR = /etc/postgresql/9.4
PGXS = /usr/lib/sparcv9/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = 'CC=/opt/solarisstudio12.4/bin/cc' 'CFLAGS=-m64 -xO5 -fast -fsimple=0 -xalias_level=any' 'LDFLAGS=-m64' '--with-libedit-preferred' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--mandir=/usr/share/postgresql/9.4/man' '--docdir=/usr/share/doc/postgresql-doc-9.4' '--sysconfdir=/etc/postgresql/9.4' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.4' '--bindir=/usr/postgresql/9.4/bin' '--libdir=/usr/lib/sparcv9/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-dtrace' 'DTRACEFLAGS=-64' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--disable-rpath' '--with-uuid=e2fs' '--with-pgport=5432' '--with-system-tzdata=/usr/share/lib/zoneinfo'
CC = /opt/solarisstudio12.4/bin/cc -Xa
CPPFLAGS = -I/usr/include/libxml2
CFLAGS = -m64 -xO5 -fast -fsimple=0 -xalias_level=any
CFLAGS_SL = -KPIC
LDFLAGS = -L../../../src/common -m64 -Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lxslt -lxml2 -lpam -lssl -lcrypto -lz -ledit -lnsl -lsocket -lm
VERSION = PostgreSQL 9.4.8

On Sep 2, 2016, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Spencer Thomason <spencer@whiteskycommunications.com> writes:

I have included the master and slave configs and python scripts that trigger this on our systems.

Thanks for sending this. Unfortunately I've had zero success reproducing
the problem so far. I had about run out of ideas as to what might explain
why it fails for you and not me, when I noticed this:

Also, I should note that this is on a SPARC T2 at 1.4GHz so the single thread performance might be a factor as well.

Am I right in thinking that is a 32-bit machine? That might have
something to do with it (notably because of different maxalign).
Is the production machine you originally saw the problem on also 32-bit?

While I'm asking questions, could you send along the output of pg_config
for the build you're using?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: BUG #14295: Hot standby crash during tsvector rebuild

I wrote:

Spencer Thomason <spencer@whiteskycommunications.com> writes:

Also, I should note that this is on a SPARC T2 at 1.4GHz so the single thread performance might be a factor as well.

Am I right in thinking that is a 32-bit machine? That might have
something to do with it (notably because of different maxalign).

Awhile later it occurred to me that SPARCs are generally big-endian,
which led me to try your example on an old HPPA box, and kaboom!
I've now traced it to this bit in gindatapage.c:

int nmodifieditems;

...

memcpy(walbufend, &seginfo->nmodifieditems, sizeof(uint16));

which of course works fine on little-endian hardware and not at all on
big-endian. There might be more bugs (takes a while to run your example
on that old dinosaur :-() but this one is sufficient to explain the
known symptoms.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: BUG #14295: Hot standby crash during tsvector rebuild

I wrote:

I've now traced it to this bit in gindatapage.c:

The attached patch (against 9.4) should be sufficient to fix this problem.
Perhaps you can do some testing there while I'm doing the same.

regards, tom lane

Attachments:

gin-big-endian-bug.patchtext/x-diff; charset=us-ascii; name=gin-big-endian-bug.patchDownload+1-1
#11Spencer Thomason
spencer@whiteskycommunications.com
In reply to: Tom Lane (#10)
Re: BUG #14295: Hot standby crash during tsvector rebuild

Hi Tom,
I've been testing this from some time and this appears to resolve the issue. Thanks for the quick fix!

Best regards,
Spencer

On Sep 3, 2016, at 8:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I've now traced it to this bit in gindatapage.c:

The attached patch (against 9.4) should be sufficient to fix this problem.
Perhaps you can do some testing there while I'm doing the same.

regards, tom lane

diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 2090209..77725ac 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -86,7 +86,7 @@ typedef struct
char		action;
ItemPointerData *modifieditems;
-	int			nmodifieditems;
+	uint16		nmodifieditems;

/*
* The following fields represent the items in this segment. If 'items' is

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs