postgres 9.5 DB corruption

Started by Thomas Tignorover 6 years ago5 messagesgeneral

tptignor@yahoo.com

over 6 years ago

Hello postgres community,
Writing again to see if there are insights on this issue. We have had infrequent but recurring corruption since upgrading from postgres 9.1 to postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually performs a mixture of DML, primarily inserts and updates on two specific tables, with no single op being suspect. In the past, corruption events have produced encoding errors on COPY operations (invalid byte sequence for encoding "UTF8"). More recently, they have caused segmentation faults. We were able to take a cold backup after a recent event. SELECTing the corrupted data on our cold backup yields the following stack. Any info on a solution or how to proceed towards a solution would be much appreciated.
Thanks in advance.

(gdb) where
#0 pglz_decompress (source=source@entry=0xa617904 "0", slen=8139, dest=dest@entry=0x4268e028 "", rawsize=808452096) at pg_lzcompress.c:745
#1 0x080f3079 in toast_decompress_datum (attr=0xa6178fc) at tuptoaster.c:2210
#2 0x080f3716 in heap_tuple_untoast_attr (attr=0xa6178fc) at tuptoaster.c:183
#3 0x08440955 in pg_detoast_datum_packed (datum=<optimized out>) at fmgr.c:2270
#4 0x084145bf in text_to_cstring (t=0x7592fd2a) at varlena.c:176
#5 0x0843e874 in FunctionCall1Coll (flinfo=flinfo@entry=0xa614738, collation=collation@entry=0, arg1=arg1@entry=1972567338) at fmgr.c:1297
#6 0x0843fef8 in OutputFunctionCall (flinfo=0xa614738, val=1972567338) at fmgr.c:1950
#7 0x080bf84b in printtup (slot=0xa613bf4, self=0xa60d714) at printtup.c:359
#8 0x08220f9a in ExecutePlan (dest=0xa60d714, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, planstate=0xa613974, estate=0xa6138ec) at execMain.c:1574
#9 standard_ExecutorRun (queryDesc=0xa6134e4, direction=ForwardScanDirection, count=0) at execMain.c:337
#10 0x08332c1b in PortalRunSelect (portal=portal@entry=0xa6114dc, forward=forward@entry=1 '\001', count=0, count@entry=2147483647, dest=dest@entry=0xa60d714) at pquery.c:942
#11 0x08333fa7 in PortalRun (portal=portal@entry=0xa6114dc, count=count@entry=2147483647, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xa60d714, altdest=altdest@entry=0xa60d714, completionTag=completionTag@entry=0xffd5d71c "")
at pquery.c:786
#12 0x08330ba8 in exec_simple_query (query_string=0xa5f1754 "select * from ams.alert_attribute_bak;") at postgres.c:1096
#13 PostgresMain (argc=1, argv=0xa53dbbc, dbname=0xa53daec "ams", username=0xa53dadc "akamai") at postgres.c:4049
#14 0x080b53af in BackendRun (port=0xa584b78) at postmaster.c:4312
#15 BackendStartup (port=0xa584b78) at postmaster.c:3986
#16 ServerLoop () at postmaster.c:1705
#17 0x082d0dd7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xa53d2a8) at postmaster.c:1313
#18 0x080b68eb in main (argc=3, argv=0xa53d2a8) at main.c:228
(gdb)

Tom :-)

Import Notes

Reference msg id not found: 531909537.157210.1563979100115.ref@mail.yahoo.com

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: Thomas Tignor (#1)

Re: postgres 9.5 DB corruption

On 7/24/19 7:38 AM, Thomas Tignor wrote:

Hello postgres community,

Writing again to see if there are insights on this issue. We have had
infrequent but recurring corruption since upgrading from postgres 9.1 to
postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
performs a mixture of DML, primarily inserts and updates on two specific
tables, with no single op being suspect. In the past, corruption events
have produced encoding errors on COPY operations (invalid byte sequence
for encoding "UTF8"). More recently, they have caused segmentation
faults. We were able to take a cold backup after a recent event.
SELECTing the corrupted data on our cold backup yields the following
stack. Any info on a solution or how to proceed towards a solution would
be much appreciated.

More information would be useful:

1) Schema of the tables.

2) Source of the data.

Thanks in advance.

(gdb) where
#0 pglz_decompress (source=source@entry=0xa617904 "0", slen=8139, dest=dest@entry=0x4268e028 "", rawsize=808452096) at pg_lzcompress.c:745
#1 0x080f3079 in toast_decompress_datum (attr=0xa6178fc) at tuptoaster.c:2210
#2 0x080f3716 in heap_tuple_untoast_attr (attr=0xa6178fc) at tuptoaster.c:183
#3 0x08440955 in pg_detoast_datum_packed (datum=<optimized out>) at fmgr.c:2270
#4 0x084145bf in text_to_cstring (t=0x7592fd2a) at varlena.c:176
#5 0x0843e874 in FunctionCall1Coll (flinfo=flinfo@entry=0xa614738, collation=collation@entry=0, arg1=arg1@entry=1972567338) at fmgr.c:1297
#6 0x0843fef8 in OutputFunctionCall (flinfo=0xa614738, val=1972567338) at fmgr.c:1950
#7 0x080bf84b in printtup (slot=0xa613bf4, self=0xa60d714) at printtup.c:359
#8 0x08220f9a in ExecutePlan (dest=0xa60d714, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, planstate=0xa613974, estate=0xa6138ec) at execMain.c:1574
#9 standard_ExecutorRun (queryDesc=0xa6134e4, direction=ForwardScanDirection, count=0) at execMain.c:337
#10 0x08332c1b in PortalRunSelect (portal=portal@entry=0xa6114dc, forward=forward@entry=1 '\001', count=0, count@entry=2147483647, dest=dest@entry=0xa60d714) at pquery.c:942
#11 0x08333fa7 in PortalRun (portal=portal@entry=0xa6114dc, count=count@entry=2147483647, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xa60d714, altdest=altdest@entry=0xa60d714, completionTag=completionTag@entry=0xffd5d71c "")
at pquery.c:786
#12 0x08330ba8 in exec_simple_query (query_string=0xa5f1754 "select * from ams.alert_attribute_bak;") at postgres.c:1096
#13 PostgresMain (argc=1, argv=0xa53dbbc, dbname=0xa53daec "ams", username=0xa53dadc "akamai") at postgres.c:4049
#14 0x080b53af in BackendRun (port=0xa584b78) at postmaster.c:4312
#15 BackendStartup (port=0xa584b78) at postmaster.c:3986
#16 ServerLoop () at postmaster.c:1705
#17 0x082d0dd7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xa53d2a8) at postmaster.c:1313
#18 0x080b68eb in main (argc=3, argv=0xa53d2a8) at main.c:228
(gdb)

Tom :-)

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: Thomas Tignor (#1)

Re: postgres 9.5 DB corruption

On 7/24/19 7:38 AM, Thomas Tignor wrote:

Hello postgres community,

Writing again to see if there are insights on this issue. We have had
infrequent but recurring corruption since upgrading from postgres 9.1 to
postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
performs a mixture of DML, primarily inserts and updates on two specific
tables, with no single op being suspect. In the past, corruption events
have produced encoding errors on COPY operations (invalid byte sequence
for encoding "UTF8"). More recently, they have caused segmentation
faults. We were able to take a cold backup after a recent event.
SELECTing the corrupted data on our cold backup yields the following
stack. Any info on a solution or how to proceed towards a solution would
be much appreciated.

Thanks in advance.

In my previous post when I referred to table schema I mean that to
include associated schema like triggers, constraints, etc. Basically
what is returned by \d in psql.

Tom :-)

--
Adrian Klaver
adrian.klaver@aklaver.com

Thomas Tignor

tptignor@yahoo.com

over 6 years ago

In reply to: Adrian Klaver (#3)

Re: postgres 9.5 DB corruption

Hi Adrian,Thanks for responding. Below is the schema data for the tables where we always see corruption. You'll notice they have triggers for a postgres extension called Slony-I which provides replication service. It's not clear if/how that's a factor, though.

ams=# \d ams.alert_instance

Table "ams.alert_instance"

Column | Type | Modifiers

---------------------+--------------------------------+-----------

alert_instance_id | integer | not null

alert_definition_id | integer | not null

alert_instance_key | character varying(500) | not null

start_active_date | timestamp(0) without time zone | not null

stop_active_date | timestamp(0) without time zone |

active | smallint | not null

acknowledged | smallint | not null

ack_clear_time | timestamp(0) without time zone |

user_set_clear_time | smallint |

category_id | integer | not null

condition_start | timestamp(0) without time zone | not null

unack_reason | character varying(1) |

viewer_visible | smallint | not null

Indexes:

"pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"

"idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"

"idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"

"idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"

Check constraints:

"ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)

"ck_alert_inst_active" CHECK (active = 0 OR active = 1)

"ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)

"ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)

Foreign-key constraints:

"fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)

"fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)

"fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)

Referenced by:

TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')

_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')

Disabled user triggers:

_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')

_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

ams=# \d ams.alert_attribute

Table "ams.alert_attribute"

Column | Type | Modifiers

-------------------+-------------------------+-----------

alert_instance_id | integer | not null

name | character varying(200) | not null

data_type | smallint | not null

value | character varying(2000) |

Indexes:

"pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"

"idx_alert_attr_name" btree (name)

Foreign-key constraints:

"fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')

_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')

Disabled user triggers:

_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')

_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

Tom :-)

On Wednesday, July 24, 2019, 11:15:04 AM EDT, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 7/24/19 7:38 AM, Thomas Tignor wrote:

Hello postgres community,

Writing again to see if there are insights on this issue. We have had
infrequent but recurring corruption since upgrading from postgres 9.1 to
postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
performs a mixture of DML, primarily inserts and updates on two specific
tables, with no single op being suspect. In the past, corruption events
have produced encoding errors on COPY operations (invalid byte sequence
for encoding "UTF8"). More recently, they have caused segmentation
faults. We were able to take a cold backup after a recent event.
SELECTing the corrupted data on our cold backup yields the following
stack. Any info on a solution or how to proceed towards a solution would
be much appreciated.

Thanks in advance.

In my previous post when I referred to table schema I mean that to
include associated schema like triggers, constraints, etc. Basically
what is returned by \d in psql.

Tom :-)

--
Adrian Klaver
adrian.klaver@aklaver.com

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: Thomas Tignor (#4)

Re: postgres 9.5 DB corruption

On 7/25/19 10:24 AM, Thomas Tignor wrote:

Hi Adrian,
Thanks for responding. Below is the schema data for the tables where we
always see corruption. You'll notice they have triggers for a postgres
extension called Slony-I which provides replication service. It's not
clear if/how that's a factor, though.

What specific version of Slony?

Did you upgrade Slony when you moved from 9.1 to 9.5?

Trace you showed in your first post was for:

ams.alert_attribute_bak

I do not see that below.

Are the errors on any specific field?

The errors are occurring on the primary, correct?

Where is the data coming from?

ams=# \d ams.alert_instance

Table "ams.alert_instance"

Column|Type| Modifiers

---------------------+--------------------------------+-----------

alert_instance_id| integer| not null

alert_definition_id | integer| not null

alert_instance_key| character varying(500)| not null

start_active_date| timestamp(0) without time zone | not null

stop_active_date| timestamp(0) without time zone |

active| smallint| not null

acknowledged| smallint| not null

ack_clear_time| timestamp(0) without time zone |

user_set_clear_time | smallint|

category_id| integer| not null

condition_start| timestamp(0) without time zone | not null

unack_reason| character varying(1)|

viewer_visible| smallint| not null

Indexes:

"pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace
"tbls5"

"idx_alert_inst_1" btree (alert_instance_key, alert_definition_id,
alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"

"idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"

"idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"

Check constraints:

"ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)

"ck_alert_inst_active" CHECK (active = 0 OR active = 1)

"ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR
user_set_clear_time = 1)

"ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)

Foreign-key constraints:

"fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES
ams.category(category_id)

"fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES
ams.alert_definition(alert_definition_id)

"fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES
ams.unack_reason(unack_reason)

Referenced by:

TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id"
FOREIGN KEY (alert_instance_id) REFERENCES
ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE
_ams_cluster.logtrigger('_ams_cluster', '1', 'k')

_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')

Disabled user triggers:

_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE
_ams_cluster.denyaccess('_ams_cluster')

_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

ams=# \d ams.alert_attribute

Table "ams.alert_attribute"

Column|Type| Modifiers

-------------------+-------------------------+-----------

alert_instance_id | integer| not null

name| character varying(200)| not null

data_type| smallint| not null

value| character varying(2000) |

Indexes:

"pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name),
tablespace "tbls5"

"idx_alert_attr_name" btree (name)

Foreign-key constraints:

"fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES
ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE
_ams_cluster.logtrigger('_ams_cluster', '2', 'kk')

_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')

Disabled user triggers:

_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE
_ams_cluster.denyaccess('_ams_cluster')

_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

Tom :-)

On Wednesday, July 24, 2019, 11:15:04 AM EDT, Adrian Klaver
<adrian.klaver@aklaver.com> wrote:

On 7/24/19 7:38 AM, Thomas Tignor wrote:

Hello postgres community,

Writing again to see if there are insights on this issue. We have had
infrequent but recurring corruption since upgrading from postgres 9.1 to
postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
performs a mixture of DML, primarily inserts and updates on two specific
tables, with no single op being suspect. In the past, corruption events
have produced encoding errors on COPY operations (invalid byte sequence
for encoding "UTF8"). More recently, they have caused segmentation
faults. We were able to take a cold backup after a recent event.
SELECTing the corrupted data on our cold backup yields the following
stack. Any info on a solution or how to proceed towards a solution would
be much appreciated.

Thanks in advance.

In my previous post when I referred to table schema I mean that to
include associated schema like triggers, constraints, etc. Basically
what is returned by \d in psql.

Tom :-)

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com