Duplicate data despite unique constraint

Started by Jonas Tehleralmost 10 years ago6 messagesgeneral

Jump to latest

Jonas Tehler

jonas@tehler.se

almost 10 years ago

Hi,

We’re running Postgresql 9.4.5 on Amazon RDS. One of our tables looks something like this:

CREATE TABLE users
(
...
email character varying(128) NOT NULL,
...
CONSTRAINT users_email_key UNIQUE (email)
)

Despite this we have rows with very similar email values. I discovered the problem when I tried to add a column and got the following error:

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not create unique index "users_email_key"
DETAIL: Key (email)=(xxx@yyy.com) is duplicated.
: ALTER TABLE "users" ADD COLUMN "client_branch" character varying(255) DEFAULT ‘beta'

Now look at this:

=> select email from users where email = 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
(1 row)

=> select email from users where email LIKE 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
xxx@yyy.com
(2 rows)

I have tried to compare the binary data in various ways, email::bytes, md5(email), encode(email::bytea, 'hex’), char_length(email) and it all looks the same for both rows.

Any suggestions how I can discover the difference between the values and how they could have been added without triggering the constraint? I know that the values were added after the constraint was added.

The data was added from a Ruby on Rails app that also has unique constraints on the email field and validation on the email format.

/ Jonas

Adrian Klaver

adrian.klaver@aklaver.com

almost 10 years ago

In reply to: Jonas Tehler (#1)

Re: Duplicate data despite unique constraint

On 09/02/2016 04:32 AM, Jonas Tehler wrote:

Hi,

We’re running Postgresql 9.4.5 on Amazon RDS. One of our tables
looks something like this:

CREATE TABLE users
(
...
email character varying(128) NOT NULL,
...
CONSTRAINT users_email_key UNIQUE (email)
)

Despite this we have rows with very similar email values. I discovered
the problem when I tried to add a column and got the following error:

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not
create unique index "users_email_key"
DETAIL: Key (email)=(xxx@yyy.com <mailto:xxx@yyy.com>) is duplicated.
: ALTER TABLE "users" ADD COLUMN "client_branch" character varying(255)
DEFAULT ‘beta'

Now look at this:

=> select email from users where email = 'xxx@yyy.com <mailto:xxx@yyy.com>';
email
---------------------------
xxx@yyy.com <mailto:xxx@yyy.com>
(1 row)

=> select email from users where email LIKE 'xxx@yyy.com
<mailto:xxx@yyy.com>';
email
---------------------------
xxx@yyy.com <mailto:xxx@yyy.com>
xxx@yyy.com <mailto:xxx@yyy.com>
(2 rows)

I have tried to compare the binary data in various ways, email::bytes,
md5(email), encode(email::bytea, 'hex’), char_length(email) and it all
looks the same for both rows.

Any suggestions how I can discover the difference between the values and
how they could have been added without triggering the constraint? I know
that the values were added after the constraint was added.

select ctid, email from users where email LIKE 'xxx@yyy.com';

https://www.postgresql.org/docs/9.5/static/ddl-system-columns.html
"ctid

The physical location of the row version within its table. Note
that although the ctid can be used to locate the row version very
quickly, a row's ctid will change if it is updated or moved by VACUUM
FULL. Therefore ctid is useless as a long-term row identifier. The OID,
or even better a user-defined serial number, should be used to identify
logical rows.
"

Best guess is the INDEX on the column is corrupted and needs to be
reindexed:

https://www.postgresql.org/docs/9.5/static/sql-reindex.html

The data was added from a Ruby on Rails app that also has unique
constraints on the email field and validation on the email format.

/ Jonas

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Laurenz Albe

laurenz.albe@cybertec.at

almost 10 years ago

In reply to: Jonas Tehler (#1)

Re: Duplicate data despite unique constraint

Jonas Tehler wrote:

We’re running Postgresql 9.4.5 on Amazon RDS. One of our tables looks something like this:

CREATE TABLE users
(
...
email character varying(128) NOT NULL,
...
CONSTRAINT users_email_key UNIQUE (email)
)

Despite this we have rows with very similar email values. I discovered the problem when I tried to add
a column and got the following error:

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not create unique index
"users_email_key"
DETAIL: Key (email)=(xxx@yyy.com) is duplicated.
: ALTER TABLE "users" ADD COLUMN "client_branch" character varying(255) DEFAULT ‘beta'

Now look at this:

=> select email from users where email = 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
(1 row)

=> select email from users where email LIKE 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
xxx@yyy.com
(2 rows)

I have tried to compare the binary data in various ways, email::bytes, md5(email),
encode(email::bytea, 'hex’), char_length(email) and it all looks the same for both rows.

Any suggestions how I can discover the difference between the values and how they could have been
added without triggering the constraint? I know that the values were added after the constraint was
added.

The data was added from a Ruby on Rails app that also has unique constraints on the email field and
validation on the email format.

That looks very much like data corruption.

I guess there is an index on "users" that is used for one query but not the other.
Can you verify with EXPLAIN?

Assuming that it is a 'text_ops' or 'varchar_ops' index, I'd say it gets used for the first
query, but not for the second. That would mean that there is an extra entry in the table that
is not in the index.

Did you have any crashes, standby promotion, restore with PITR or other unusual occurrences recently?

Make sure you have a physical backup; there may be other things corrupted.

This is a possible path to proceed:

Once you have made sure that you have a physical backup, try to add the "ctid" column to both queries.

Then delete the extra row from the second query with "DELETE FROM email WHERE ctid = ...".

Then, to make sure there is no other corruption lurking, make a logical backup
with pg_dumpall, create a new database cluster, create a new one with "initdb" and
restore the data.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Adrian Klaver (#2)

Re: Duplicate data despite unique constraint

On 9/2/16 8:02 AM, Adrian Klaver wrote:

Best guess is the INDEX on the column is corrupted and needs to be
reindexed:

You should contact AWS support about this; they'd want to know.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Daniel Caldeweyher

dcalde@gmail.com

almost 10 years ago

In reply to: Laurenz Albe (#3)

Re: Duplicate data despite unique constraint

I had a similar issue once and was able to recover from it. If this affects
only some rows and you are able to identify them, this is fixable:

--force table scan to skip using corrupt index
set enable_seqscan=1
set enable_indexscan=0
set enable_bitmapscan=0
select email,count(*)
from users
group by email
having count(*) > 1;

Then, if the rows are simply just duplicates and have no other changes, add
a new serial column (or to prevent blocking, add a bigint column and update
with sequential values), then using the emails from above, delete the ones
with the higher/lower sequence number. Ensure you are still skipping
indexes.

Once the table is clean, drop the sequence column again and re-index.

Hope this helps,
Daniel

On Fri, Sep 2, 2016 at 11:06 PM, Albe Laurenz <laurenz.albe@wien.gv.at>
wrote:

Show quoted text

Jonas Tehler wrote:

We’re running Postgresql 9.4.5 on Amazon RDS. One of our tables looks

something like this:

CREATE TABLE users
(
...
email character varying(128) NOT NULL,
...
CONSTRAINT users_email_key UNIQUE (email)
)

Despite this we have rows with very similar email values. I discovered

the problem when I tried to add

a column and got the following error:

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not

create unique index

"users_email_key"
DETAIL: Key (email)=(xxx@yyy.com) is duplicated.
: ALTER TABLE "users" ADD COLUMN "client_branch" character varying(255)

DEFAULT ‘beta'

Now look at this:

=> select email from users where email = 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
(1 row)

=> select email from users where email LIKE 'xxx@yyy.com';
email
---------------------------
xxx@yyy.com
xxx@yyy.com
(2 rows)

I have tried to compare the binary data in various ways, email::bytes,

md5(email),

encode(email::bytea, 'hex’), char_length(email) and it all looks the

same for both rows.

Any suggestions how I can discover the difference between the values and

how they could have been

added without triggering the constraint? I know that the values were

added after the constraint was

added.

The data was added from a Ruby on Rails app that also has unique

constraints on the email field and

validation on the email format.

That looks very much like data corruption.

I guess there is an index on "users" that is used for one query but not
the other.
Can you verify with EXPLAIN?

Assuming that it is a 'text_ops' or 'varchar_ops' index, I'd say it gets
used for the first
query, but not for the second. That would mean that there is an extra
entry in the table that
is not in the index.

Did you have any crashes, standby promotion, restore with PITR or other
unusual occurrences recently?

Make sure you have a physical backup; there may be other things corrupted.

This is a possible path to proceed:

Once you have made sure that you have a physical backup, try to add the
"ctid" column to both queries.

Then delete the extra row from the second query with "DELETE FROM email
WHERE ctid = ...".

Then, to make sure there is no other corruption lurking, make a logical
backup
with pg_dumpall, create a new database cluster, create a new one with
"initdb" and
restore the data.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Daniel Caldeweyher (#5)

Re: Duplicate data despite unique constraint

Daniel Caldeweyher <dcalde@gmail.com> writes:

I had a similar issue once and was able to recover from it. If this affects
only some rows and you are able to identify them, this is fixable:

--force table scan to skip using corrupt index
set enable_seqscan=1
set enable_indexscan=0
set enable_bitmapscan=0
select email,count(*)
from users
group by email
having count(*) > 1;

Then, if the rows are simply just duplicates and have no other changes, add
a new serial column (or to prevent blocking, add a bigint column and update
with sequential values), then using the emails from above, delete the ones
with the higher/lower sequence number. Ensure you are still skipping
indexes.

FWIW, that's kind of the hard way; you can just use the "ctid" system
column as a guaranteed-unique row identifier, instead of having to add
and remove a column. The latter is very expensive, and if there is any
actual data corruption in the heap it could make things worse.

select ctid, * from users where email = 'duplicated value';
-- choose which row(s) to nuke, then
delete from users where ctid = 'chosen value';

You definitely want to reindex after the data cleanup, since presumably
it's corruption of a unique index that got you into this mess in the
first place. But as long as it's only the index and not the table that's
damaged, recovery is pretty straightforward.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general