Inexplicable duplicate rows with unique constraint

Started by Richard van der Hoffover 6 years ago14 messagesgeneral

richard@matrix.org

over 6 years ago

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

The problem relates to a bug filed against our application
(https://github.com/matrix-org/synapse/issues/6696). At first I put this
down to random data corruption on a single user's postgres instance, but
I've now seen three separate reports in as many days and am wondering if
there is more to it.

We have a table whose schema is as follows:

Despite the presence of the
current_state_events_room_id_type_state_key_key constraint, several
users have reported seeing errors which suggest that their tables have
duplicate rows for the same (room_id, type, state_key) triplet and
indeed querying confirms that to be the case:

Further investigation suggests that these are genuinely separate rows
rather than duplicate entries in an index.

The index appears to consider itself valid:

synapse=> select i.* from pg_class c join pg_index i on
i.indexrelid=c.oid where
relname='current_state_events_room_id_type_state_key_key';
indexrelid | indrelid | indnatts | indisunique | indisprimary |
indisexclusion | indimmediate | indisclustered | indisvalid |
indcheckxmin | indisready | indislive | indisreplident | indkey |
indcollation | indclass | indoption | indexprs | indpred
------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+---------
17023 | 16456 | 3 | t | f | f
| t | f | t | f | t
| t | f | 2 3 4 | 100 100 100 | 3126
3126 3126 | 0 0 0 | |
(1 row)

So, question: what could we be doing wrong to get ourselves into this
situation?

Some other datapoints which may be relevant:
* this has been reported by one user on postgres 9.6.15 and one on
10.10, though it's hard to be certain of the version that was running
when the duplication occurred
* the constraint is added when the table is first created (before any
data is added)
* At least one user reports that he has recently migrated his database
from one server to another via a `pg_dump -C` and later piping into psql.

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: Richard van der Hoff (#1)

Re: Inexplicable duplicate rows with unique constraint

On 1/16/20 8:50 AM, Richard van der Hoff wrote:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

The problem relates to a bug filed against our application
(https://github.com/matrix-org/synapse/issues/6696). At first I put this
down to random data corruption on a single user's postgres instance, but
I've now seen three separate reports in as many days and am wondering if
there is more to it.

We have a table whose schema is as follows:

synapse=# \d current_state_events
Table "public.current_state_events"
   Column   | Type | Modifiers
------------+------+-----------
event_id   | text | not null
room_id    | text | not null
type       | text | not null
state_key | text | not null
membership | text |
Indexes:
    "current_state_events_event_id_key" UNIQUE CONSTRAINT, btree
(event_id)
    "current_state_events_room_id_type_state_key_key" UNIQUE
CONSTRAINT, btree (room_id, type, state_key)
    "current_state_events_member_index" btree (state_key) WHERE type =
'm.room.member'::text

Despite the presence of the
current_state_events_room_id_type_state_key_key constraint, several
users have reported seeing errors which suggest that their tables have
duplicate rows for the same (room_id, type, state_key) triplet and
indeed querying confirms that to be the case:

synapse=> select count(*), room_id, type, state_key from
current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2;
count |              room_id              |     type      |   state_key
-------+-----------------------------------+---------------+-------------------------------------

     3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member |
@irc_ebi_:darkfasel.net
     3 | !HwocBmCtBcHQhILtYQ:matrix.org    | m.room.member |
@freenode_AlmarShenwan_:matrix.org
(2 rows)

I'm assuming the above are obfuscated?

Further investigation suggests that these are genuinely separate rows
rather than duplicate entries in an index.

If you use length() on the values are they the same?

The index appears to consider itself valid:

synapse=> select i.* from pg_class c join pg_index i on
i.indexrelid=c.oid where
relname='current_state_events_room_id_type_state_key_key';
indexrelid | indrelid | indnatts | indisunique | indisprimary |
indisexclusion | indimmediate | indisclustered | indisvalid |
indcheckxmin | indisready | indislive | indisreplident | indkey |
indcollation |    indclass    | indoption | indexprs | indpred
------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+---------

      17023 |    16456 |        3 | t           | f            | f
        | t            | f              | t          | f            | t
         | t         | f              | 2 3 4 | 100 100 100 | 3126
3126 3126 | 0 0 0     |          |
(1 row)

So, question: what could we be doing wrong to get ourselves into this
situation?

Some other datapoints which may be relevant:
* this has been reported by one user on postgres 9.6.15 and one on
10.10, though it's hard to be certain of the version that was running
when the duplication occurred
* the constraint is added when the table is first created (before any
data is added)
* At least one user reports that he has recently migrated his database
from one server to another via a `pg_dump -C` and later piping into psql.

--
Adrian Klaver
adrian.klaver@aklaver.com

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Richard van der Hoff (#1)

Re: Inexplicable duplicate rows with unique constraint

Richard van der Hoff <richard@matrix.org> writes:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

Since these are text columns, one possibility you should be looking into
is that the indexes have become corrupt due to a change in the operating
system's sorting rules for the underlying locale. I don't recall details
at the moment, but I do remember that a recent glibc update changed the
sorting rules for some popular locale settings. If an installation had
applied such an update underneath an existing database, you'd have a
situation where existing entries in an index are not in-order according
to the new behavior of the text comparison operators, leading to havoc
because btree searching relies on the entries being correctly sorted.

Unless you happen to notice searches failing to find rows you know are
there, the first visible symptom is often appearance of "impossible"
duplicate rows, after the search to verify uniqueness of a new entry
fails to find the old entry.

* At least one user reports that he has recently migrated his database
from one server to another via a `pg_dump -C` and later piping into psql.

Dump-and-restore wouldn't cause this (and, indeed, is one way to clean up
the mess). But this is suspicious anyway because it suggests there may
have been some general system upgrades going on in the vicinity.

Reindexing all text indexes is the recommended remediation procedure
if you suspect a locale behavior change. There's some work afoot to
make PG notice the need for this automatically, but it's not done yet.

regards, tom lane

Daniel Verite

daniel@manitou-mail.org

over 6 years ago

In reply to: Richard van der Hoff (#1)

Re: Inexplicable duplicate rows with unique constraint

Richard van der Hoff wrote:

So, question: what could we be doing wrong to get ourselves into this
situation?

OS/libc upgrades without reindexing come to mind.
See https://wiki.postgresql.org/wiki/Collations

* At least one user reports that he has recently migrated his database
from one server to another via a `pg_dump -C` and later piping into psql.

This kind of migration recreates indexes (since it recreates everything)
so it's not subject to the same hazard as an OS/libc upgrade without
reindexing. In fact it would detect the problem if it existed,
as the creation of the unique constraint in the new db would fail if the
data in the dump did not satisfy it.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Magnus Hagander

magnus@hagander.net

over 6 years ago

In reply to: Tom Lane (#3)

Re: Inexplicable duplicate rows with unique constraint

On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard van der Hoff <richard@matrix.org> writes:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

Since these are text columns, one possibility you should be looking into
is that the indexes have become corrupt due to a change in the operating
system's sorting rules for the underlying locale. I don't recall details
at the moment, but I do remember that a recent glibc update changed the
sorting rules for some popular locale settings. If an installation had
applied such an update underneath an existing database, you'd have a
situation where existing entries in an index are not in-order according
to the new behavior of the text comparison operators, leading to havoc
because btree searching relies on the entries being correctly sorted.

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Richard van der Hoff

richard@matrix.org

over 6 years ago

In reply to: Magnus Hagander (#5)

Re: Inexplicable duplicate rows with unique constraint

On 16/01/2020 17:12, Magnus Hagander wrote:

On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard van der Hoff <richard@matrix.org> writes:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

Since these are text columns, one possibility you should be looking into
is that the indexes have become corrupt due to a change in the operating
system's sorting rules for the underlying locale. I don't recall details
at the moment, but I do remember that a recent glibc update changed the
sorting rules for some popular locale settings. If an installation had
applied such an update underneath an existing database, you'd have a
situation where existing entries in an index are not in-order according
to the new behavior of the text comparison operators, leading to havoc
because btree searching relies on the entries being correctly sorted.

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

Right, thanks to all who have suggested this.

It seems like a plausible explanation but it's worth noting that all the
indexed data here is (despite being in text columns), plain ascii. I'm
surprised that a change in collation rules would change the sorting of
such strings, and hence that it could lead to this problem. Am I naive?

To answer Adrian's question: the lengths of the values in the indexed
columns are identical between the duplicated rows.

Daniel Verite

daniel@manitou-mail.org

over 6 years ago

In reply to: Richard van der Hoff (#1)

Re: Inexplicable duplicate rows with unique constraint

Richard van der Hoff wrote:

synapse=> select count(*), room_id, type, state_key from
current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2;
count | room_id | type | state_key
-------+-----------------------------------+---------------+-------------------------------------
3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member |
@irc_ebi_:darkfasel.net
3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member |
@freenode_AlmarShenwan_:matrix.org

Looking at these columns which are of type text but do not
contain words of any particular language, there's probably
no point in using a linguistic-aware collation for them.

If you maintain the database schema, what you could do to avoid
the dependency on the OS collation and stay clear of the particular
upgrade difficulty of collations is to use COLLATE "C" for this kind of
field, as opposed to the default collation of the database.
As a bonus, operations with the "C" collations tend to be faster,
sometimes even much faster.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: Richard van der Hoff (#6)

Re: Inexplicable duplicate rows with unique constraint

On 1/16/20 9:24 AM, Richard van der Hoff wrote:

On 16/01/2020 17:12, Magnus Hagander wrote:

On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard van der Hoff <richard@matrix.org> writes:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might
have
missed.

Since these are text columns, one possibility you should be looking into
is that the indexes have become corrupt due to a change in the operating
system's sorting rules for the underlying locale. I don't recall
details
at the moment, but I do remember that a recent glibc update changed the
sorting rules for some popular locale settings. If an installation had
applied such an update underneath an existing database, you'd have a
situation where existing entries in an index are not in-order according
to the new behavior of the text comparison operators, leading to havoc
because btree searching relies on the entries being correctly sorted.

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

Right, thanks to all who have suggested this.

It seems like a plausible explanation but it's worth noting that all the
indexed data here is (despite being in text columns), plain ascii. I'm
surprised that a change in collation rules would change the sorting of
such strings, and hence that it could lead to this problem. Am I naive?

In psql who does:

\l the_database_name

show?

To answer Adrian's question: the lengths of the values in the indexed
columns are identical between the duplicated rows.

--
Adrian Klaver
adrian.klaver@aklaver.com

Richard van der Hoff

richard@matrix.org

over 6 years ago

In reply to: Daniel Verite (#7)

Re: Inexplicable duplicate rows with unique constraint

On 16/01/2020 17:24, Daniel Verite wrote:

Looking at these columns which are of type text but do not
contain words of any particular language, there's probably
no point in using a linguistic-aware collation for them.

If you maintain the database schema, what you could do to avoid
the dependency on the OS collation and stay clear of the particular
upgrade difficulty of collations is to use COLLATE "C" for this kind of
field, as opposed to the default collation of the database.
As a bonus, operations with the "C" collations tend to be faster,
sometimes even much faster.

Good to know, thanks Daniel!

#10

Richard van der Hoff

richard@matrix.org

over 6 years ago

In reply to: Adrian Klaver (#8)

Re: Inexplicable duplicate rows with unique constraint

On 16/01/2020 17:27, Adrian Klaver wrote:

On 1/16/20 9:24 AM, Richard van der Hoff wrote:

It seems like a plausible explanation but it's worth noting that all
the indexed data here is (despite being in text columns), plain ascii.
I'm surprised that a change in collation rules would change the
sorting of such strings, and hence that it could lead to this problem.
Am I naive?

In psql who does:

\l the_database_name

show?

#11

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Richard van der Hoff (#6)

Re: Inexplicable duplicate rows with unique constraint

Richard van der Hoff <richard@matrix.org> writes:

On 16/01/2020 17:12, Magnus Hagander wrote:

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

It seems like a plausible explanation but it's worth noting that all the
indexed data here is (despite being in text columns), plain ascii. I'm
surprised that a change in collation rules would change the sorting of
such strings, and hence that it could lead to this problem. Am I naive?

Unfortunately, strings containing punctuation do sort differently
after these changes, even with all-ASCII data. The example given
on that wiki page demonstrates this.

RHEL6 (old glibc):

$ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
11
1-1

Fedora 30 (new glibc):

$ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
1-1
11

I concur with Daniel's suggestion that maybe "C" locale is
the thing to use for this data.

regards, tom lane

#12

Richard van der Hoff

richard@matrix.org

over 6 years ago

In reply to: Tom Lane (#11)

Re: Inexplicable duplicate rows with unique constraint

On 16/01/2020 17:48, Tom Lane wrote:

Richard van der Hoff <richard@matrix.org> writes:

On 16/01/2020 17:12, Magnus Hagander wrote:

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

It seems like a plausible explanation but it's worth noting that all the
indexed data here is (despite being in text columns), plain ascii. I'm
surprised that a change in collation rules would change the sorting of
such strings, and hence that it could lead to this problem. Am I naive?

Unfortunately, strings containing punctuation do sort differently
after these changes, even with all-ASCII data. The example given
on that wiki page demonstrates this.

Thank you Tom. I should learn to read properly.

I think this very much explains the symptoms we're seeing. Thanks very
much for the detailed and very helpful answers!

#13

Susan Hurst

susan.hurst@brookhurstdata.com

over 6 years ago

In reply to: Tom Lane (#11)

Re: Inexplicable duplicate rows with unique constraint

That's why I created a virtual_string function to squeeze out everything
but alpha characters and numbers 0-9 from any varchar or text columns
that I want to use as business key columns. For example, if I have a
column named job_name, I will have a companion column named v_job_name.
The v_ column is to replicate Oracle's virtual column, since postgres
doesn't have it. You don't put any values in the v_ column directly. I
simply have a trigger on insert or update to put the value in the
v_job_name column using the virtual_string(new.job_name) function. It's
the v_job_name column that use in my unique constraint so that I avoid
any unexpected sorting. Meanwhile, my job_name column is still human
readable with whatever characters I want to see, including diacritics.

Here is my function, if you want to try it out:

create or replace function store.virtual_string(string_in text)
returns text as
$body$
declare
l_return text;
begin
l_return := regexp_replace
(lower(unaccent(string_in)),'[^0-9a-z]','','g');
return l_return;
end;
$body$
language plpgsql volatile security definer
;

Sue

---
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Susan E Hurst
Principal Consultant
Brookhurst Data LLC
Email: susan.hurst@brookhurstdata.com
Mobile: 314-486-3261

Show quoted text

On 2020-01-16 11:48, Tom Lane wrote:

Richard van der Hoff <richard@matrix.org> writes:

On 16/01/2020 17:12, Magnus Hagander wrote:

See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
which linux distros updated when.

It seems like a plausible explanation but it's worth noting that all
the
indexed data here is (despite being in text columns), plain ascii. I'm
surprised that a change in collation rules would change the sorting of
such strings, and hence that it could lead to this problem. Am I
naive?

Unfortunately, strings containing punctuation do sort differently
after these changes, even with all-ASCII data. The example given
on that wiki page demonstrates this.

RHEL6 (old glibc):

$ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
11
1-1

Fedora 30 (new glibc):

$ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
1-1
11

I concur with Daniel's suggestion that maybe "C" locale is
the thing to use for this data.

regards, tom lane

#14

Laurenz Albe

laurenz.albe@cybertec.at

over 6 years ago

In reply to: Richard van der Hoff (#1)

Re: Inexplicable duplicate rows with unique constraint

On Thu, 2020-01-16 at 16:50 +0000, Richard van der Hoff wrote:

I'm trying to track down the cause of some duplicate rows in a table
which I would expect to be impossible due to a unique constraint. I'm
hoping that somebody here will be able to suggest something I might have
missed.

The problem relates to a bug filed against our application
(https://github.com/matrix-org/synapse/issues/6696). At first I put this
down to random data corruption on a single user's postgres instance, but
I've now seen three separate reports in as many days and am wondering if
there is more to it.

[...]

So, question: what could we be doing wrong to get ourselves into this
situation?

Some other datapoints which may be relevant:
* this has been reported by one user on postgres 9.6.15 and one on
10.10, though it's hard to be certain of the version that was running
when the duplication occurred
* the constraint is added when the table is first created (before any
data is added)
* At least one user reports that he has recently migrated his database
from one server to another via a `pg_dump -C` and later piping into psql.

I see no hint that this may be the problem, but I have seen corruption
like this because of changes in the collations of the C library (which
PostgreSQL uses).

This only happens with collations other than C, and it cannot be caused
by dump/restore.

It may, however, be caused by the following:

- Upgrading the operating system where PostgreSQL is running to a
different glibc.

- Streaming replication between machines with different glibc version,
and failing over to the standby.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com