Missing PRIMARY KEYs and duplicated rows
Hi all, we're encountering a very strange issue lately on a 9.4.5 server.
We recently performed a dump & restore of the entire cluster to
migrate to a new server. All but one of the databases seem fine.
That one is encountering a very strange issue.
It seems that periodically, the primary keys (and underlying indexes)
of all tables in the database disappear, and every row in each table
gets duplicated. This appears to occur at the storage level, as the
duplicated rows get their own ctids (so it's not just a query
artifact). I've seen this happen twice in the past week.
I can delete the duplicate rows (by their ctid). In the first
instance, I was unable to re-add the primary key, as it had reappeared
by the time I issued the statement. In the most recent instance, I
did re-add the primary key to a table, and the other primary keys have
not yet re-appeared.
All indexes are b-tree indexes. They were not created using
CONCURRENTLY. They are in the default tablespace. The installation
is on an Intel CentOS machine using the PGDG RPMs. I've checked the
release notes for all 9.4 bugfixes and do not see anything which could
relate to this issue.
Any insight into this issue? We will probably just drop and recreate
the database from the recent dump, as it is on a production server and
we'd rather not take chances tinkering around with it. But it would
be nice to know why/how we've encountered this to ensure it won't
happen again, because frankly it's quite a scary issue.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Apr 12, 2017 at 1:46 PM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
All indexes are b-tree indexes.
Can you show us the definition of all affected indexes? Any
discernible pattern to them?
--
Peter Geoghegan
VMware vCenter Server
https://www.vmware.com/
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Apr 12, 2017 at 4:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 12, 2017 at 1:46 PM, Chris Pacejo <cpacejo@clearskydata.com> wrote:
All indexes are b-tree indexes.
Can you show us the definition of all affected indexes? Any
discernible pattern to them?
They are bog-standard b-trees created on behalf of a primary key,
almost always an integer or bigint. E.g., from one of the unaffected
databases:
Indexes:
"pk_databasechangeloglock" PRIMARY KEY, btree (id)
The primary keys disappear from pg_class as well.
The only pattern is that they're all in the same database. Which
seems very strange to me; I'm having trouble thinking of what part of
Postgres would affect both schema AND data in all tables of ONE
database. If I remember correctly each table is stored in a separate
file, and the WAL and server processes are shared across all
databases.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 13/04/17 08:46, Chris Pacejo wrote:
Hi all, we're encountering a very strange issue lately on a 9.4.5 server.
We recently performed a dump & restore of the entire cluster to
migrate to a new server. All but one of the databases seem fine.
That one is encountering a very strange issue.It seems that periodically, the primary keys (and underlying indexes)
of all tables in the database disappear, and every row in each table
gets duplicated. This appears to occur at the storage level, as the
duplicated rows get their own ctids (so it's not just a query
artifact). I've seen this happen twice in the past week.I can delete the duplicate rows (by their ctid). In the first
instance, I was unable to re-add the primary key, as it had reappeared
by the time I issued the statement. In the most recent instance, I
did re-add the primary key to a table, and the other primary keys have
not yet re-appeared.All indexes are b-tree indexes. They were not created using
CONCURRENTLY. They are in the default tablespace. The installation
is on an Intel CentOS machine using the PGDG RPMs. I've checked the
release notes for all 9.4 bugfixes and do not see anything which could
relate to this issue.Any insight into this issue? We will probably just drop and recreate
the database from the recent dump, as it is on a production server and
we'd rather not take chances tinkering around with it. But it would
be nice to know why/how we've encountered this to ensure it won't
happen again, because frankly it's quite a scary issue.
I'd be inclined to suspect the hardware with weird behaviour like that
(memory and disk errors in particular).
Cheers
Mark
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Apr 12, 2017 6:13 PM, "Mark Kirkwood" <mark.kirkwood@catalyst.net.nz>
wrote:
I'd be inclined to suspect the hardware with weird behaviour like that
(memory and disk errors in particular).
Me too, but how would that explain the observation that it's occurring only
within one database (and on all tables within that database)? The only
thing all tables within a database have in common on disk is that they
reside in the same directory.
And the ctids of the duplicate rows indicate that they're allocated
sequentially in the same page as the legitimate rows. (The tables are
mostly all less than a dozen rows.) How could this be explained by disk
corruption?
I suppose memory corruption is possible, but it still seems oddly specific.
Import Notes
Reply to msg id not found: CAC8iE5gX_4aXtGm3OV4kKkH=U83KexTGC=Cxqd7kEYKy0AHrQA@mail.gmail.com
On 13/04/17 10:22, Chris Pacejo wrote:
On Apr 12, 2017 6:13 PM, "Mark Kirkwood"
<mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
wrote:I'd be inclined to suspect the hardware with weird behaviour like
that (memory and disk errors in particular).Me too, but how would that explain the observation that it's occurring
only within one database (and on all tables within that database)? The
only thing all tables within a database have in common on disk is that
they reside in the same directory.And the ctids of the duplicate rows indicate that they're allocated
sequentially in the same page as the legitimate rows. (The tables are
mostly all less than a dozen rows.) How could this be explained by
disk corruption?I suppose memory corruption is possible, but it still seems oddly
specific.
Things disappearing and coming back sounds more like bad memory to as
well. I'd recommend taking that server down and running memcheck.
regards
Mark
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Aha. In fact it is none of the above. This turns out to be a
mundane, if insidious, combination of DNS weirdness and doubly-run
restore script. Thank you all for your time, sorry to bother the
list.
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs