replication behind high lag

Started by AI Rummanabout 13 years ago8 messagesgeneral
Jump to latest
#1AI Rumman
rummandba@gmail.com

Hi,

I have two 9.2 databases running with hot_standby replication. Today when I
was checking, I found that replication has not been working since Mar 1st.
There was a large database restored in master on that day and I believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

Thanks.

#2Lonni J Friedman
netllama@gmail.com
In reply to: AI Rumman (#1)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com> wrote:

Hi,

I have two 9.2 databases running with hot_standby replication. Today when I
was checking, I found that replication has not been working since Mar 1st.
There was a large database restored in master on that day and I believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3AI Rumman
rummandba@gmail.com
In reply to: Lonni J Friedman (#2)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman <netllama@gmail.com>wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com> wrote:

Hi,

I have two 9.2 databases running with hot_standby replication. Today

when I

was checking, I found that replication has not been working since Mar

1st.

There was a large database restored in master on that day and I believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
receive, pg_xlog_location_diff(pg_last_xlog_replay_location(),

'0/0')

AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0

What is the actual reason?

Thanks.

#4Lonni J Friedman
netllama@gmail.com
In reply to: AI Rumman (#3)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com> wrote:

Hi,

I have two 9.2 databases running with hot_standby replication. Today
when I
was checking, I found that replication has not been working since Mar
1st.
There was a large database restored in master on that day and I believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
receive, pg_xlog_location_diff(pg_last_xlog_replay_location(),
'0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0

What is the actual reason?

Corruption? What were you doing when you saw the error?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5AI Rumman
rummandba@gmail.com
In reply to: Lonni J Friedman (#4)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman <netllama@gmail.com>wrote:

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com>

wrote:

Hi,

I have two 9.2 databases running with hot_standby replication. Today
when I
was checking, I found that replication has not been working since Mar
1st.
There was a large database restored in master on that day and I

believe

after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0')

AS

receive, pg_xlog_location_diff(pg_last_xlog_replay_location(),
'0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0

What is the actual reason?

Corruption? What were you doing when you saw the error?

I did not have enough idea about these stuffs. I got the database now and
saw the error.
Is there any way to recover from this state. The master database is a large
database of 500 GB.

#6Lonni J Friedman
netllama@gmail.com
In reply to: AI Rumman (#5)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com>
wrote:

Hi,

I have two 9.2 databases running with hot_standby replication. Today
when I
was checking, I found that replication has not been working since Mar
1st.
There was a large database restored in master on that day and I
believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0')
AS
receive, pg_xlog_location_diff(pg_last_xlog_replay_location(),
'0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127, offset
0

What is the actual reason?

Corruption? What were you doing when you saw the error?

I did not have enough idea about these stuffs. I got the database now and
saw the error.
Is there any way to recover from this state. The master database is a large
database of 500 GB.

generate a new base backup, and seed the slave with it. if the error
persists, then i'd guess that your master is corrupted, and then
you've got huge problems.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7AI Rumman
rummandba@gmail.com
In reply to: AI Rumman (#1)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 4:03 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 4:00 PM, Lonni J Friedman <netllama@gmail.com>wrote:

On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman <rummandba@gmail.com>

wrote:

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman <

netllama@gmail.com>

wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com>
wrote:

Hi,

I have two 9.2 databases running with hot_standby replication.

Today

when I
was checking, I found that replication has not been working since

Mar

1st.
There was a large database restored in master on that day and I
believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS
offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),

'0/0')

AS
receive,

pg_xlog_location_diff(pg_last_xlog_replay_location(),

'0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog

file

-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127,

offset

0

What is the actual reason?

Corruption? What were you doing when you saw the error?

I did not have enough idea about these stuffs. I got the database now

and

saw the error.
Is there any way to recover from this state. The master database is a

large

database of 500 GB.

generate a new base backup, and seed the slave with it. if the error
persists, then i'd guess that your master is corrupted, and then
you've got huge problems.

Master is running fine right now showing only a warning:
WARNING: archive_mode enabled, yet archive_command is not set

Do you think the master could be corrupted?

Hi,

I got the info that there was a master db restart on Feb 27th. Could this
be a reason of this error?

Thanks.

#8Lonni J Friedman
netllama@gmail.com
In reply to: AI Rumman (#7)
Re: replication behind high lag

On Mon, Mar 25, 2013 at 1:23 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 4:03 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 4:00 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman <rummandba@gmail.com> wrote:

On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman <netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman <rummandba@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman
<netllama@gmail.com>
wrote:

On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman <rummandba@gmail.com>
wrote:

Hi,

I have two 9.2 databases running with hot_standby replication.
Today
when I
was checking, I found that replication has not been working since
Mar
1st.
There was a large database restored in master on that day and I
believe
after that the lag went higher.

SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0')
AS
offset

431326108320

SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
'0/0')
AS
receive,
pg_xlog_location_diff(pg_last_xlog_replay_location(),
'0/0')
AS replay

receive | replay
--------------+--------------
245987541312 | 245987534032
(1 row)

I checked the pg_xlog in both the server. In Slave the last xlog
file
-rw------- 1 postgres postgres 16777216 Mar 1 06:02
00000001000000390000007F

In Master, the first xlog file is
-rw------- 1 postgres postgres 16777216 Mar 1 04:45
00000001000000390000005E

Is there any way I could sync the slave in quick process?

generate a new base backup, and seed the slave with it.

OK. I am getting these error in slave:
LOG: invalid contrecord length 284 in log file 57, segment 127,
offset
0

What is the actual reason?

Corruption? What were you doing when you saw the error?

I did not have enough idea about these stuffs. I got the database now
and
saw the error.
Is there any way to recover from this state. The master database is a
large
database of 500 GB.

generate a new base backup, and seed the slave with it. if the error
persists, then i'd guess that your master is corrupted, and then
you've got huge problems.

Master is running fine right now showing only a warning:
WARNING: archive_mode enabled, yet archive_command is not set

Do you think the master could be corrupted?

Hi,

I got the info that there was a master db restart on Feb 27th. Could this be
a reason of this error?

restarting the database cleanly should never cause corruption. again,
you need to create a new base backup, and seed the slave with it. if
the problem persists, then the master is likely corrupted.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general