pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

Started by Joe Van Dykover 11 years ago8 messagesgeneral
Jump to latest
#1Joe Van Dyk
joe@tanga.com

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length 2190
at A6C/331AAA90" in the logs and replication has paused. I have other
streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

The slave that has the problem is also located across the country, while
the slave that works is in the same data center as the master -- not sure
if that's related at all.

Joe

#2basti
basti@unix-solution.de
In reply to: Joe Van Dyk (#1)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

Hello,

months ago I have a similar problem with the OOM-Killer.
Have a look at
http://www.credativ.co.uk/credativ-blog/2010/03/postgresql-and-linux-memory-management

I hope that's helpful.

Regards,
basti

On Sat 25.10.2014 22:55 +0200, Joe Van Dyk <joe@tanga.com> wrote:

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length
2190 at A6C/331AAA90" in the logs and replication has paused. I have
other streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

The slave that has the problem is also located across the country, while
the slave that works is in the same data center as the master -- not
sure if that's related at all.

Joe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3basti
mailinglist@unix-solution.de
In reply to: Joe Van Dyk (#1)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

Hello,

months ago I have a similar problem with the OOM-Killer.
Have a look at
http://www.credativ.co.uk/credativ-blog/2010/03/postgresql-and-linux-memory-management

I hope that's helpful.

Regards,
basti

On Sat 25.10.2014 22:55 +0200, Joe Van Dyk <joe@tanga.com> wrote:

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length
2190 at A6C/331AAA90" in the logs and replication has paused. I have
other streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

The slave that has the problem is also located across the country, while
the slave that works is in the same data center as the master -- not
sure if that's related at all.

Joe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Joe Van Dyk
joe@tanga.com
In reply to: basti (#3)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

On Mon, Oct 27, 2014 at 8:16 AM, basti <mailinglist@unix-solution.de> wrote:

Hello,

months ago I have a similar problem with the OOM-Killer.
Have a look at

http://www.credativ.co.uk/credativ-blog/2010/03/postgresql-and-linux-memory-management

Thanks -- my question is not so much about the oom killer, but rather about
why just one of the slaves is reporting the "invalid contrecord length"
error.

Show quoted text

I hope that's helpful.

Regards,
basti

On Sat 25.10.2014 22:55 +0200, Joe Van Dyk <joe@tanga.com> wrote:

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length
2190 at A6C/331AAA90" in the logs and replication has paused. I have
other streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

The slave that has the problem is also located across the country, while
the slave that works is in the same data center as the master -- not
sure if that's related at all.

Joe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5basti
mailinglist@unix-solution.de
In reply to: Joe Van Dyk (#4)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

I'm no PG expert but it seem that your WAL record is corrupt just on
this one slave.
Perhaps you can check this with md5 or something.

perhaps your master process die in this moment there the file was written?
So the question is
"How does PG sync WAL file between multiple slaves?"
Async or Synchronous?

Am 27.10.2014 17:00, schrieb Joe Van Dyk:

Show quoted text

On Mon, Oct 27, 2014 at 8:16 AM, basti <mailinglist@unix-solution.de
<mailto:mailinglist@unix-solution.de>> wrote:

Hello,

months ago I have a similar problem with the OOM-Killer.
Have a look at
http://www.credativ.co.uk/credativ-blog/2010/03/postgresql-and-linux-memory-management

Thanks -- my question is not so much about the oom killer, but rather
about why just one of the slaves is reporting the "invalid contrecord
length" error.

I hope that's helpful.

Regards,
basti

On Sat 25.10.2014 22:55 +0200, Joe Van Dyk <joe@tanga.com
<mailto:joe@tanga.com>> wrote:

One of my postgres backends was killed by the oom-killer. Now,

one of my

streaming replication slaves is reporting "invalid contrecord length
2190 at A6C/331AAA90" in the logs and replication has paused. I have
other streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that

has the

problem, and 9.3.5 on the slave that doesn't have the problem.

Is this

something that was fixed in 9.3.5?

The slave that has the problem is also located across the

country, while

the slave that works is in the same data center as the master -- not
sure if that's related at all.

Joe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Emanuel Calvo
emanuel.calvo@2ndquadrant.com
In reply to: Joe Van Dyk (#1)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

El 25/10/14 a las 17:55, Joe Van Dyk escibió:

One of my postgres backends was killed by the oom-killer. Now, one of
my streaming replication slaves is reporting "invalid contrecord
length 2190 at A6C/331AAA90" in the logs and replication has paused. I
have other streaming replication slaves that are fine.

Is that expected? It's happened twice in two days.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has
the problem, and 9.3.5 on the slave that doesn't have the problem. Is
this something that was fixed in 9.3.5?

The slave that has the problem is also located across the country,
while the slave that works is in the same data center as the master --
not sure if that's related at all.

Joe

It's a corrupted slave. You'll need to regenerate it to ensure that your
data is safe.

The OOM killer doesn't respect what it kills, so this certain things can
happen without
a proper kernel configuration.

--
--
Emanuel Calvo http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Andres Freund
andres@anarazel.de
In reply to: Joe Van Dyk (#1)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

On 2014-10-25 13:55:57 -0700, Joe Van Dyk wrote:

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length 2190
at A6C/331AAA90" in the logs and replication has paused. I have other
streaming replication slaves that are fine.

Is it a LOG or a PANIC message? Because it's not unexpected to see such
messages when reaching the end of the local and/or restore_command
provided WAL.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

We have really no information to answer that question accurately.

So you really need to provide logs and such.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Joe Van Dyk
joe@tanga.com
In reply to: Andres Freund (#7)
Re: pg killed by oom-killer, "invalid contrecord length 2190 at A6C/331AAA90" on slaves

On Tue, Oct 28, 2014 at 7:43 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-25 13:55:57 -0700, Joe Van Dyk wrote:

One of my postgres backends was killed by the oom-killer. Now, one of my
streaming replication slaves is reporting "invalid contrecord length 2190
at A6C/331AAA90" in the logs and replication has paused. I have other
streaming replication slaves that are fine.

Is it a LOG or a PANIC message? Because it's not unexpected to see such
messages when reaching the end of the local and/or restore_command
provided WAL.

It's a log message. The server is still running, just replication has
paused.

I'm running 9.3.5 on the master. I have 9.3.4 on the slave that has the
problem, and 9.3.5 on the slave that doesn't have the problem. Is this
something that was fixed in 9.3.5?

We have really no information to answer that question accurately.

So you really need to provide logs and such.

I'll try to find something next time it happens.

Joe

Show quoted text

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services