CORRUPTION on TOAST table

Started by Soni Mabout 10 years ago7 messagesgeneral
Jump to latest
#1Soni M
diptatapa@gmail.com

Hello Everyone,

We face TOAST table corruption.

One master and two streaming replicas. The corruption happen only on both
streaming replicas.

We did found the corrupted rows. Selecting on this row, return (on both
replica) : unexpected chunk number 0 (expected 1) for toast value
1100613112 in pg_toast_112517
selecting this row on master does not return corruption error, but return
correct result instead.

Previously, dump on a replica return : unexpected chunk number 0 (expected
1) for toast value 3234098599 in pg_toast_112517 (please note the toast
value is different)

This table size is 343 GB, contain around 206,179,697 live tuples. We found
that the corruption happen on the biggest column (this column and its pkey
sized around 299 GB total).

replica1 :
ESX 5.5, VM Version 8
Intel(R) Xeon(R) CPU E5649 @ 2.53GHz
8GB RAM
Storage – Raw Disk Mapping in ESX from 3PAR 7400 SAN using Fast Class (10k)
disk
Each volume (single disk as presented by SAN) on the VMs is its own LVM
volume.

replica2 :
ESX 5.5, VM Version 8
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
8GB RAM
Raw Disk Mapping in ESX from 3PAR 7400 SAN using Fast Clkass (10k) disk
Each volume (single disk as presented by SAN) on the VMs is its own LVM
volume.

on both replica :
fsync NEVER turned off.
none unexpected power loss nor OS crash.

How can the corruption occurs ? and how can I resolve them ?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Soni M (#1)
Re: CORRUPTION on TOAST table

What version of PostgreSQL and which OS?

On 04/02/2016 08:38 PM, Soni M wrote:

How can the corruption occurs ? and how can I resolve them ?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Soni M
diptatapa@gmail.com
In reply to: Joshua D. Drake (#2)
Re: CORRUPTION on TOAST table

Sorry, miss that info. Master 9.1.13, replica1 9.1.13, replica2 9.1.19.
Master Red Hat Enterprise Linux Server release 6.5 (Santiago),
replica1 Red Hat Enterprise Linux Server release 6.5 (Santiago),
replica2 Red Hat Enterprise Linux Server release 6.7 (Santiago).

On Sun, Apr 3, 2016 at 10:43 AM, Joshua D. Drake <jd@commandprompt.com>
wrote:

What version of PostgreSQL and which OS?

On 04/02/2016 08:38 PM, Soni M wrote:

How can the corruption occurs ? and how can I resolve them ?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

--
Regards,

Soni Maula Harriz

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Soni M (#1)
Re: CORRUPTION on TOAST table

On 04/02/2016 08:38 PM, Soni M wrote:

Hello Everyone,

We face TOAST table corruption.

One master and two streaming replicas. The corruption happen only on
both streaming replicas.

We did found the corrupted rows. Selecting on this row, return (on both
replica) : unexpected chunk number 0 (expected 1) for toast value
1100613112 in pg_toast_112517
selecting this row on master does not return corruption error, but
return correct result instead.

Previously, dump on a replica return : unexpected chunk number 0
(expected 1) for toast value 3234098599 in pg_toast_112517 (please note
the toast value is different)

This table size is 343 GB, contain around 206,179,697 live tuples. We
found that the corruption happen on the biggest column (this column and
its pkey sized around 299 GB total).

replica1 :
ESX 5.5, VM Version 8
Intel(R) Xeon(R) CPU E5649 @ 2.53GHz
8GB RAM
Storage – Raw Disk Mapping in ESX from 3PAR 7400 SAN using Fast Class
(10k) disk
Each volume (single disk as presented by SAN) on the VMs is its own LVM
volume.

replica2 :
ESX 5.5, VM Version 8
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
8GB RAM
Raw Disk Mapping in ESX from 3PAR 7400 SAN using Fast Clkass (10k) disk
Each volume (single disk as presented by SAN) on the VMs is its own LVM
volume.

So where is the master data located, on the SAN or somewhere different?

To be clear about above, each replica is it own VM with its own virtual
disk/volume as served up from the same SAN, correct?

Can you elaborate more on what is actually taking place with the raw
disk mapping?

on both replica :
fsync NEVER turned off.
none unexpected power loss nor OS crash.

How can the corruption occurs ? and how can I resolve them ?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Soni M (#1)
Re: CORRUPTION on TOAST table

On 04/02/2016 08:38 PM, Soni M wrote:

Hello Everyone,

We face TOAST table corruption.

One master and two streaming replicas. The corruption happen only on
both streaming replicas.

We did found the corrupted rows. Selecting on this row, return (on both
replica) : unexpected chunk number 0 (expected 1) for toast value
1100613112 in pg_toast_112517
selecting this row on master does not return corruption error, but
return correct result instead.

Previously, dump on a replica return : unexpected chunk number 0
(expected 1) for toast value 3234098599 in pg_toast_112517 (please note
the toast value is different)

This table size is 343 GB, contain around 206,179,697 live tuples. We
found that the corruption happen on the biggest column (this column and
its pkey sized around 299 GB total).

on both replica :
fsync NEVER turned off.
none unexpected power loss nor OS crash.

How can the corruption occurs ? and how can I resolve them ?

Meant to add to previous post.

Do you see anything in the replica Postgres logs that indicate a problem
with the replication process?

Or any other unexpected messages prior to the point you did the select
on the replica(s)?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Soni M
diptatapa@gmail.com
In reply to: Adrian Klaver (#5)
Re: CORRUPTION on TOAST table

Hello Adrian, thanks for the response.

master data also located on SAN

Yes, each replica is it own VM with its own virtual disk/volume as served
up from the same SAN

Raw disk mappings are a way for ESX to present a SAN volume directly to a
VM instead of creating a virtual disk.

no unexpected messages detected.

On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 04/02/2016 08:38 PM, Soni M wrote:

Hello Everyone,

We face TOAST table corruption.

One master and two streaming replicas. The corruption happen only on
both streaming replicas.

We did found the corrupted rows. Selecting on this row, return (on both
replica) : unexpected chunk number 0 (expected 1) for toast value
1100613112 in pg_toast_112517
selecting this row on master does not return corruption error, but
return correct result instead.

Previously, dump on a replica return : unexpected chunk number 0
(expected 1) for toast value 3234098599 in pg_toast_112517 (please note
the toast value is different)

This table size is 343 GB, contain around 206,179,697 live tuples. We
found that the corruption happen on the biggest column (this column and
its pkey sized around 299 GB total).

on both replica :

fsync NEVER turned off.
none unexpected power loss nor OS crash.

How can the corruption occurs ? and how can I resolve them ?

Meant to add to previous post.

Do you see anything in the replica Postgres logs that indicate a problem
with the replication process?

Or any other unexpected messages prior to the point you did the select on
the replica(s)?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Regards,

Soni Maula Harriz

#7Soni M
diptatapa@gmail.com
In reply to: Soni M (#6)
Re: CORRUPTION on TOAST table

It seems that it was the Postgres bug on replica, after upgrading minor
version to 9.1.21 on replica1, the corruption goes away.

Thanks everyone for the help

On Tue, Apr 5, 2016 at 1:32 AM, Soni M <diptatapa@gmail.com> wrote:

Hello Adrian, thanks for the response.

master data also located on SAN

Yes, each replica is it own VM with its own virtual disk/volume as served
up from the same SAN

Raw disk mappings are a way for ESX to present a SAN volume directly to a
VM instead of creating a virtual disk.

no unexpected messages detected.

On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 04/02/2016 08:38 PM, Soni M wrote:

Hello Everyone,

We face TOAST table corruption.

One master and two streaming replicas. The corruption happen only on
both streaming replicas.

We did found the corrupted rows. Selecting on this row, return (on both
replica) : unexpected chunk number 0 (expected 1) for toast value
1100613112 in pg_toast_112517
selecting this row on master does not return corruption error, but
return correct result instead.

Previously, dump on a replica return : unexpected chunk number 0
(expected 1) for toast value 3234098599 in pg_toast_112517 (please note
the toast value is different)

This table size is 343 GB, contain around 206,179,697 live tuples. We
found that the corruption happen on the biggest column (this column and
its pkey sized around 299 GB total).

on both replica :

fsync NEVER turned off.
none unexpected power loss nor OS crash.

How can the corruption occurs ? and how can I resolve them ?

Meant to add to previous post.

Do you see anything in the replica Postgres logs that indicate a problem
with the replication process?

Or any other unexpected messages prior to the point you did the select on
the replica(s)?

Thank so much for the help.

Cheers \o/

--
Regards,

Soni Maula Harriz

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Regards,

Soni Maula Harriz

--
Regards,

Soni Maula Harriz