emergency outage requiring database restart

Started by Merlin Moncureover 9 years ago76 messageshackers
Jump to latest
#1Merlin Moncure
mmoncure@gmail.com

Today I had an emergency production outage on a server. This
particular server was running 9.5.2. The symptoms were interesting
so I thought I'd report. Here is what I saw:

*) User CPU was pegged 100%
*) Queries reading data would block and not respond to cancel or terminate
*) pg_stat_activity reported no waiting queries (but worked fine otherwise).

Adding all this up it smells like processes were getting stuck on a spinlock.

Connections quickly got eaten up and situation was desperately urgent
so I punted and did an immediate restart and things came back
normally. I had a console to the database and did manage to grab
contents of pg_stat_activity and noticed several trivial queries were
running normally (according to pg_stat_activity) but were otherwise
stuck. Attempting to run one of them myself, I noted query got stuck
and did not cancel. I was in a terrible rush but am casting around
for stuff to grab out in case that happens again -- 'perf top' would
be a natural choice I guess.

Three autovacuum processes were running. Obviously going to do bugfix
upgrade but was wondering if anybody has seen anything like this.
This particular server was upgraded to 9.5 somewhat recently but ran
on 9.2 for years with no issues.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#1)
Re: emergency outage requiring database restart

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#2)
Re: emergency outage requiring database restart

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view. Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
SELECT ...

ERROR: type "vw_apartmentqueuelastgood" already exists
HINT: A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange. I had to manually delete the pg_type
record in order to create the view. I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#3)
Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view. Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
SELECT ...

ERROR: type "vw_apartmentqueuelastgood" already exists
HINT: A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange. I had to manually delete the pg_type
record in order to create the view. I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#4)
Re: emergency outage requiring database restart

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#5)
Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

agreed. current plan is to restore from backups, and recover as much
data as I can. Also doing bugfix release and going to enable
checksums.

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#6)
Re: emergency outage requiring database restart

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#7)
Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Michael Paquier (#8)
Re: emergency outage requiring database restart

On 18/10/16 14:12, Michael Paquier wrote:

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.

that started life on different hardware with a different O/S

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#5)
Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#10)
Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Bruce Momjian
bruce@momjian.us
In reply to: Merlin Moncure (#11)
Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Merlin Moncure
mmoncure@gmail.com
In reply to: Bruce Momjian (#12)
Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#13)
Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

ok, data corruption struck again. This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING: leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#14)
Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

ok, data corruption struck again. This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING: leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()
(gdb)

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#15)
Re: emergency outage requiring database restart

Merlin Moncure <mmoncure@gmail.com> writes:

single user mode dumps core :(

You've got a mess there :-(

Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#16)
Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

single user mode dumps core :(

You've got a mess there :-(

Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.

can't; don't have the package unfortunately.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#15)
Re: emergency outage requiring database restart

Merlin Moncure wrote:

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()

Maybe
rm global/pg_internal.init
and try again?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#18)
Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()

Maybe
rm global/pg_internal.init
and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged. This server is new to production usage, maybe 2 months.

Here is contents of pg_extension
plpgsql
dblink
hstore
postgres_fdw
plsh * not used
pg_trgm * not used
plr * not used
tablefunc * not used
adminpack * not used
plpythonu * not used
postgis * not used
postgis_topology * not used

Short term plan is to separate the database to it's own cluster,
install replication and checksums. All queries to this database are
logged. Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute S_3: COMMIT
2016-10-17 12:12:25 CDT [@]: ERROR: could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [@]: CONTEXT: automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres@castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres@castaging]: LOG: execute <unnamed>:
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL: parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres@castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT: SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING: leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg(). They
were also called basically all day long in 5 second intervals. I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Merlin Moncure (#10)
Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Most or all the damage seemed to be to the system catalogs with
at least two critical tables dropped or inaccessible in some
fashion. A lot of the OIDs seemed to be pointing at the wrong
thing.

While the oid in pg_class often matches the filename, that is not
true after some operations (like CLUSTER or VACUUM FULL). It is
the relfilenode column that is the definitive link to the file.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Merlin Moncure
mmoncure@gmail.com
In reply to: Kevin Grittner (#20)
#22Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#22)
#24Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#23)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#25)
#27Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#26)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#30)
#32Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#14)
#33Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#32)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#33)
#35Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#35)
#37Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#37)
#39Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#38)
#40Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#28)
#41Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#39)
#42Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#41)
#43Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#43)
#45Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#44)
#46Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#43)
#47Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#46)
#48Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#47)
#49Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#48)
#50Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#49)
#51Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#50)
#52Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Merlin Moncure (#51)
#53Merlin Moncure
mmoncure@gmail.com
In reply to: Jim Nasby (#52)
#54Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#52)
#55Ants Aasma
ants.aasma@cybertec.at
In reply to: Merlin Moncure (#35)
#56Merlin Moncure
mmoncure@gmail.com
In reply to: Ants Aasma (#55)
#57Bruce Momjian
bruce@momjian.us
In reply to: Merlin Moncure (#56)
#58Merlin Moncure
mmoncure@gmail.com
In reply to: Bruce Momjian (#57)
#59Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Merlin Moncure (#58)
#60Merlin Moncure
mmoncure@gmail.com
In reply to: Jim Nasby (#59)
In reply to: Merlin Moncure (#56)
#62Merlin Moncure
mmoncure@gmail.com
In reply to: Oskari Saarenmaa (#61)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#62)
#64Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#63)
#65Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#63)
In reply to: Andres Freund (#46)
#67Merlin Moncure
mmoncure@gmail.com
In reply to: Oskari Saarenmaa (#66)
#68Peter Eisentraut
peter_e@gmx.net
In reply to: Merlin Moncure (#67)
#69Peter Eisentraut
peter_e@gmx.net
In reply to: Oskari Saarenmaa (#66)
#70Merlin Moncure
mmoncure@gmail.com
In reply to: Peter Eisentraut (#68)
#71Ants Aasma
ants.aasma@cybertec.at
In reply to: Merlin Moncure (#35)
#72Merlin Moncure
mmoncure@gmail.com
In reply to: Ants Aasma (#71)
#73Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#69)
#74Ants Aasma
ants.aasma@cybertec.at
In reply to: Merlin Moncure (#35)
#75Merlin Moncure
mmoncure@gmail.com
In reply to: Ants Aasma (#74)
#76Merlin Moncure
mmoncure@gmail.com
In reply to: Peter Eisentraut (#68)