emergency outage requiring database restart

Started by Merlin Moncureover 9 years ago76 messageshackers

mmoncure@gmail.com

over 9 years ago

Today I had an emergency production outage on a server. This
particular server was running 9.5.2. The symptoms were interesting
so I thought I'd report. Here is what I saw:

*) User CPU was pegged 100%
*) Queries reading data would block and not respond to cancel or terminate
*) pg_stat_activity reported no waiting queries (but worked fine otherwise).

Adding all this up it smells like processes were getting stuck on a spinlock.

Connections quickly got eaten up and situation was desperately urgent
so I punted and did an immediate restart and things came back
normally. I had a console to the database and did manage to grab
contents of pg_stat_activity and noticed several trivial queries were
running normally (according to pg_stat_activity) but were otherwise
stuck. Attempting to run one of them myself, I noted query got stuck
and did not cancel. I was in a terrible rush but am casting around
for stuff to grab out in case that happens again -- 'perf top' would
be a natural choice I guess.

Three autovacuum processes were running. Obviously going to do bugfix
upgrade but was wondering if anybody has seen anything like this.
This particular server was upgraded to 9.5 somewhat recently but ran
on 9.2 for years with no issues.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#1)

Re: emergency outage requiring database restart

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#2)

Re: emergency outage requiring database restart

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view. Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
SELECT ...

ERROR: type "vw_apartmentqueuelastgood" already exists
HINT: A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange. I had to manually delete the pg_type
record in order to create the view. I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#3)

Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

Today I had an emergency production outage on a server.
...
Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe. If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view. Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
SELECT ...

ERROR: type "vw_apartmentqueuelastgood" already exists
HINT: A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange. I had to manually delete the pg_type
record in order to create the view. I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#4)

Re: emergency outage requiring database restart

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#5)

Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

agreed. current plan is to restore from backups, and recover as much
data as I can. Also doing bugfix release and going to enable
checksums.

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#6)

Re: emergency outage requiring database restart

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Alvaro Herrera (#7)

Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

GavinFlower@archidevsys.co.nz

over 9 years ago

In reply to: Michael Paquier (#8)

Re: emergency outage requiring database restart

On 18/10/16 14:12, Michael Paquier wrote:

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.

that started life on different hardware with a different O/S

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#5)

Re: emergency outage requiring database restart

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#10)

Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-# SELECT ...
ERROR: 42809: "pg_cast_oid_index" is an index
LINE 11: FROM ApartmentSample s
^
LOCATION: heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here. You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

bruce@momjian.us

over 9 years ago

In reply to: Merlin Moncure (#11)

Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Bruce Momjian (#12)

Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#13)

Re: emergency outage requiring database restart

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

ok, data corruption struck again. This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING: leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#14)

Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up. Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way. In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication). Low disk space also does not explain the bizarre
outage I had last friday.

ok, data corruption struck again. This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING: leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()
(gdb)

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#15)

Re: emergency outage requiring database restart

Merlin Moncure <mmoncure@gmail.com> writes:

single user mode dumps core :(

You've got a mess there :-(

Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#16)

Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

single user mode dumps core :(

You've got a mess there :-(

Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.

can't; don't have the package unfortunately.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#15)

Re: emergency outage requiring database restart

Merlin Moncure wrote:

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()

Maybe
rm global/pg_internal.init
and try again?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#18)

Re: emergency outage requiring database restart

On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure wrote:

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()

Maybe
rm global/pg_internal.init
and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged. This server is new to production usage, maybe 2 months.

Here is contents of pg_extension
plpgsql
dblink
hstore
postgres_fdw
plsh * not used
pg_trgm * not used
plr * not used
tablefunc * not used
adminpack * not used
plpythonu * not used
postgis * not used
postgis_topology * not used

Short term plan is to separate the database to it's own cluster,
install replication and checksums. All queries to this database are
logged. Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>:
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute S_3: COMMIT
2016-10-17 12:12:25 CDT [@]: ERROR: could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [@]: CONTEXT: automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres@castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres@castaging]: LOG: execute <unnamed>:
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL: parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres@castaging]: LOG: statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres@castaging]: LOG: statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT: SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING: leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg(). They
were also called basically all day long in 5 second intervals. I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin.Grittner@wicourts.gov

over 9 years ago

In reply to: Merlin Moncure (#10)

Re: emergency outage requiring database restart

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Most or all the damage seemed to be to the system catalogs with
at least two critical tables dropped or inaccessible in some
fashion. A lot of the OIDs seemed to be pointing at the wrong
thing.

While the oid in pg_class often matches the filename, that is not
true after some operations (like CLUSTER or VACUUM FULL). It is
the relfilenode column that is the definitive link to the file.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

mmoncure@gmail.com

over 9 years ago

In reply to: Kevin Grittner (#20)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#21)

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#22)

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Alvaro Herrera (#23)

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Jim Nasby (#24)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Alvaro Herrera (#25)

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Tom Lane (#26)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Jim Nasby (#27)

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Tom Lane (#26)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Alvaro Herrera (#29)

andres@anarazel.de

over 9 years ago

In reply to: Tom Lane (#30)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#14)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#32)

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#33)

mmoncure@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#34)

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Merlin Moncure (#35)

mmoncure@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#36)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#37)

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#38)

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Tom Lane (#28)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#39)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#41)

mmoncure@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#42)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#43)

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#44)

andres@anarazel.de

over 9 years ago

In reply to: Merlin Moncure (#43)

mmoncure@gmail.com

over 9 years ago

In reply to: Andres Freund (#46)

andres@anarazel.de

over 9 years ago

In reply to: Merlin Moncure (#47)

mmoncure@gmail.com

over 9 years ago

In reply to: Andres Freund (#48)

andres@anarazel.de

over 9 years ago

In reply to: Merlin Moncure (#49)

mmoncure@gmail.com

over 9 years ago

In reply to: Andres Freund (#50)

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Merlin Moncure (#51)

mmoncure@gmail.com

over 9 years ago

In reply to: Jim Nasby (#52)

andres@anarazel.de

over 9 years ago

In reply to: Jim Nasby (#52)

ants.aasma@cybertec.at

over 9 years ago

In reply to: Merlin Moncure (#35)

mmoncure@gmail.com

over 9 years ago

In reply to: Ants Aasma (#55)

bruce@momjian.us

over 9 years ago

In reply to: Merlin Moncure (#56)

mmoncure@gmail.com

over 9 years ago

In reply to: Bruce Momjian (#57)

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Merlin Moncure (#58)

mmoncure@gmail.com

over 9 years ago

In reply to: Jim Nasby (#59)

Oskari Saarenmaa

os@ohmu.fi

over 9 years ago

In reply to: Merlin Moncure (#56)

mmoncure@gmail.com

over 9 years ago

In reply to: Oskari Saarenmaa (#61)

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#62)

andres@anarazel.de

over 9 years ago

In reply to: Tom Lane (#63)

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#63)

Oskari Saarenmaa

os@ohmu.fi

over 9 years ago

In reply to: Andres Freund (#46)

mmoncure@gmail.com

over 9 years ago

In reply to: Oskari Saarenmaa (#66)

Peter Eisentraut

peter_e@gmx.net

over 9 years ago

In reply to: Merlin Moncure (#67)

Peter Eisentraut

peter_e@gmx.net

over 9 years ago

In reply to: Oskari Saarenmaa (#66)

mmoncure@gmail.com

over 9 years ago

In reply to: Peter Eisentraut (#68)

ants.aasma@cybertec.at

over 9 years ago

In reply to: Merlin Moncure (#35)

mmoncure@gmail.com

over 9 years ago

In reply to: Ants Aasma (#71)

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Peter Eisentraut (#69)

ants.aasma@cybertec.at

almost 9 years ago

In reply to: Merlin Moncure (#35)

mmoncure@gmail.com

almost 9 years ago

In reply to: Ants Aasma (#74)

mmoncure@gmail.com

over 6 years ago

In reply to: Peter Eisentraut (#68)