WAL receive process dies

Started by Patrick Kreckeralmost 12 years ago9 messagesgeneral

patrick@judicata.com

almost 12 years ago

We have a periodic network connectivity issue (unrelated to Postgres) that
is causing the replication to fail.

We are running Postgres 9.3 using streaming replication. We also have WAL
archives available to be replayed with restore_command. Typically when I
bring up a slave it copies over WAL archives for a while before connecting
via streaming replication.

When I notice the machine is behind in replication, I also notice that the
WAL receiver process has died. There didn't seem to be any information in
the logs about it. The solution thus far has been to restart Postgres. I
have two questions:

1. It seems that Postgres does not fall back to copying WAL archives with
its restore_command. I just want to confirm that this is what Postgres is
supposed to do when its connection via streaming replication times out.
2. Is it possible to restart replication after the WAL receiver process has
died without restarting Postgres?

Thanks!
Patrick

Craig Ringer

craig@2ndquadrant.com

almost 12 years ago

In reply to: Patrick Krecker (#1)

Re: WAL receive process dies

On 08/28/2014 09:39 AM, Patrick Krecker wrote:

We have a periodic network connectivity issue (unrelated to Postgres)
that is causing the replication to fail.

We are running Postgres 9.3 using streaming replication. We also have
WAL archives available to be replayed with restore_command. Typically
when I bring up a slave it copies over WAL archives for a while before
connecting via streaming replication.

When I notice the machine is behind in replication, I also notice that
the WAL receiver process has died. There didn't seem to be any
information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly
how to do this depends on the OS/distro/version, but you basically want
to set "ulimit -c unlimited" on some ancestor of the postmaster).

1. It seems that Postgres does not fall back to copying WAL archives
with its restore_command. I just want to confirm that this is what
Postgres is supposed to do when its connection via streaming replication
times out.

It should fall back.

2. Is it possible to restart replication after the WAL receiver process
has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and
postgresql.conf for the replica, and complete logs for the time period
of interest. You'll want to upload the logs somewhere then link to them,
do not attach them to an email to the list.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Patrick Krecker

patrick@judicata.com

almost 12 years ago

In reply to: Craig Ringer (#2)

Re: WAL receive process dies

Hi Craig -- Sorry for the late response, I've been tied up with some other
things for the last day. Just to give some context, this is a machine that
sits in our office and replicates from another read slave in production via
a tunnel set up with spiped. The spiped tunnel is working and postgres is
still stuck (it has been stuck since 8-25).

The last moment that replication was working was 2014-08-25
22:06:05.03972. We have a table called replication_time with one column and
one row that has a timestamp that is updated every second, so it's easy to
tell the last time this machine was in sync with production.

recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

Currently the WAL receive process is still not running. Interestingly,
another pg instance running on the same machine is replicating just fine.

A note about that: there is another instance running on that machine and a
definite race condition with restore_wal_s3.py, which writes the file to
/tmp before copying it to the destination requested by postgres (I just
discovered this today, this is not generally how we run our servers). So,
if both are restoring at the same time, they will step on the WAL archives
being unzipped in /tmp and bad things will happen. But, interestingly, I
checked the logs for the other machine and there is no activity on that
day. It does not appear that the WAL replay was invoked or that the WAL
receive timed out.

As for enabling the core dump, it seems that it needs to be done when
Postgres starts, and thought I would leave it running in its "stuck" state
for now. However, if you know how to enable it on a running process, let me
know. We are running Ubuntu 13.10.

On Wed, Aug 27, 2014 at 11:30 PM, Craig Ringer <craig@2ndquadrant.com>
wrote:

Show quoted text

On 08/28/2014 09:39 AM, Patrick Krecker wrote:

We have a periodic network connectivity issue (unrelated to Postgres)
that is causing the replication to fail.

We are running Postgres 9.3 using streaming replication. We also have
WAL archives available to be replayed with restore_command. Typically
when I bring up a slave it copies over WAL archives for a while before
connecting via streaming replication.

When I notice the machine is behind in replication, I also notice that
the WAL receiver process has died. There didn't seem to be any
information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly
how to do this depends on the OS/distro/version, but you basically want
to set "ulimit -c unlimited" on some ancestor of the postmaster).

1. It seems that Postgres does not fall back to copying WAL archives
with its restore_command. I just want to confirm that this is what
Postgres is supposed to do when its connection via streaming replication
times out.

It should fall back.

2. Is it possible to restart replication after the WAL receiver process
has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and
postgresql.conf for the replica, and complete logs for the time period
of interest. You'll want to upload the logs somewhere then link to them,
do not attach them to an email to the list.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

almost 12 years ago

In reply to: Patrick Krecker (#3)

Re: WAL receive process dies

On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:

Hi Craig -- Sorry for the late response, I've been tied up with some other
things for the last day. Just to give some context, this is a machine that
sits in our office and replicates from another read slave in production via
a tunnel set up with spiped. The spiped tunnel is working and postgres is
still stuck (it has been stuck since 8-25).

The last moment that replication was working was 2014-08-25
22:06:05.03972. We have a table called replication_time with one column and
one row that has a timestamp that is updated every second, so it's easy to
tell the last time this machine was in sync with production.

recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

The problem is this log entry:

2014-08-27 18:44:27 PDT ERROR: requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8

That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.

Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Patrick Krecker

patrick@judicata.com

almost 12 years ago

In reply to: Andres Freund (#4)

Re: WAL receive process dies

On Fri, Aug 29, 2014 at 2:11 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:

Hi Craig -- Sorry for the late response, I've been tied up with some

other

things for the last day. Just to give some context, this is a machine

that

sits in our office and replicates from another read slave in production

via

a tunnel set up with spiped. The spiped tunnel is working and postgres is
still stuck (it has been stuck since 8-25).

The last moment that replication was working was 2014-08-25
22:06:05.03972. We have a table called replication_time with one column

and

one row that has a timestamp that is updated every second, so it's easy

to

tell the last time this machine was in sync with production.

recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

The problem is this log entry:

2014-08-27 18:44:27 PDT ERROR: requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8

That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.

Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

We didn't have a failover. We just have one master and replica in different
AZs on Amazon EC2.

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
from replication_time"; done;

And the same command on production and I was able to verify that the xlogs
for a given point in time were the same (modtime is updated every second by
an upstart job):

spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8

Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct hostname).

I created another basebackup from the currently stuck postgres intance on
another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8

However, this new instance is able to fetch logs from S3 and replay them
without issue.

Is it possible that the data dir on the stuck instance is just corrupt? It
is not impossible for this to have happened at some point in the past due
to the race condition in fetching logs from S3 I mentioned above.

Andres Freund

andres@anarazel.de

almost 12 years ago

In reply to: Patrick Krecker (#5)

Re: WAL receive process dies

[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
from replication_time"; done;

And the same command on production and I was able to verify that the xlogs
for a given point in time were the same (modtime is updated every second by
an upstart job):

spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8

Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct hostname).

I created another basebackup from the currently stuck postgres intance on
another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Patrick Krecker

patrick@judicata.com

almost 12 years ago

In reply to: Andres Freund (#6)

Re: WAL receive process dies

On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c

"select

modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
from replication_time"; done;

And the same command on production and I was able to verify that the

xlogs

for a given point in time were the same (modtime is updated every second

by

an upstart job):

spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct hostname).

I created another basebackup from the currently stuck postgres intance on
another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing
something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made another
basebackup and tried to use it, I configured the machine to cascade from
the stuck replica, *not* from the spiped endpoint. When I properly
connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG: started streaming WAL from primary at
177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a
little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX
password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

systemid | timeline | xlogpos
---------------------+----------+--------------
5964163898407843711 | 1 | 177/53091990

Chris Hundt

chris@judicata.com

over 11 years ago

In reply to: Patrick Krecker (#7)

Re: WAL receive process dies

(One of Patrick's coworkers here; thanks a lot for your assistance)

Just in case you wanted this as well, I ran

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX
password=XXX' -c 'IDENTIFY_SYSTEM;'

(5432 is the stuck replica and 5445 is the pipe to the working replica)

systemid | timeline | xlogpos
---------------------+----------+--------------
5964163898407843711 | 1 | 174/B76D16A8

On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick@judicata.com>
wrote:

Show quoted text

On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c

"select

modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
from replication_time"; done;

And the same command on production and I was able to verify that the

xlogs

for a given point in time were the same (modtime is updated every

second by

an upstart job):

spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct

hostname).

I created another basebackup from the currently stuck postgres intance

on

another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c
'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing
something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made
another basebackup and tried to use it, I configured the machine to cascade
from the stuck replica, *not* from the spiped endpoint. When I properly
connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG: started streaming WAL from primary at
177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a
little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX
password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

systemid | timeline | xlogpos
---------------------+----------+--------------
5964163898407843711 | 1 | 177/53091990

Patrick Krecker

patrick@judicata.com

over 11 years ago

In reply to: Chris Hundt (#8)

Re: WAL receive process dies

On Sat, Aug 30, 2014 at 7:04 AM, Chris Hundt <chris@judicata.com> wrote:

(One of Patrick's coworkers here; thanks a lot for your assistance)

Just in case you wanted this as well, I ran

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX
password=XXX' -c 'IDENTIFY_SYSTEM;'

(5432 is the stuck replica and 5445 is the pipe to the working replica)

systemid | timeline | xlogpos
---------------------+----------+--------------
5964163898407843711 | 1 | 174/B76D16A8

On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick@judicata.com>
wrote:

On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:

I ran the following on the local endpoint of spiped:

while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c

"select

modtime, pg_last_xlog_receive_location(),

pg_last_xlog_replay_location()

from replication_time"; done;

And the same command on production and I was able to verify that the

xlogs

for a given point in time were the same (modtime is updated every

second by

an upstart job):

spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location

----------------------------+-------------------------------+------------------------------

2014-08-29 15:23:25.563766 | 177/2E80C9F8 |

177/2E80C9F8

To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct

hostname).

I created another basebackup from the currently stuck postgres intance

on

another machine and I also get this error:

2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of
the above as it has a flush position of '174/B76D16A8' - not something
that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following
command:
psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c
'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing
something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made
another basebackup and tried to use it, I configured the machine to cascade
from the stuck replica, *not* from the spiped endpoint. When I properly
connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG: started streaming WAL from primary at
177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a
little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX
password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

systemid | timeline | xlogpos
---------------------+----------+--------------
5964163898407843711 | 1 | 177/53091990

I I restarted the server and it promptly started fetching WAL archives from
our backups [1]http://pastebin.com/ZxM1LsNw. I am still at a loss as to why the binary replication
couldn't reconnect and why Postgres didn't fall back to fetching WAL
archives with its restore_command.

[1]: http://pastebin.com/ZxM1LsNw