streaming replication does not work across datacenter with 20ms latency?
I was doing postgresql streaming replication, which was fine when two
machine in the same datecenter. but recently I was planning to deploy
new slave at a different datecent, the latency between the master and
slave is 20ms;
below is the related configurateion:
Both master and slave have below configuration:
hot_standby = on
wal_level = hot_standby
max_wal_senders = 5
checkpoint_segments = 64
wal_keep_segments = 128
I am using pgpool to automation but the method is similar to the
method described here:
http://wiki.postgresql.org/wiki/Streaming_Replication
the data dir size is about 30G, I have tried many times but every
time after the sync was over and slave was started, postgresql is
just hanging there with error message(attached below), while trying to
connect it returns error message "psql: FATAL: the database system is
starting up"
the strange part is with same configuration, other slaves in the same
datacenter works fine...
what does invalid record length and invalid magic number normally
means? xlog corrupted?
Thanks for any further help!
the log message with debug5 level was like this(just clips, I could
upload full log file if necessary):
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: database
system was interrupted; last known up at 2011-07-23 07:07:57 CDT
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked
new backend, pid=17998 socket=8
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked
new backend, pid=17999 socket=8
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]FATAL: the database system is starting up
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: shmem_exit(1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: proc_exit(1): 1 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: exit(1)
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: shmem_exit(-1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: proc_exit(-1): 0 callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server
process (PID 17999) exited with exit code 1
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)FATAL: the database system is
starting up
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(1): 1 callbacks
to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: exit(1)
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(-1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(-1): 0
callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server
process (PID 17998) exited with exit code 1
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
standby_mode = 'on'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
primary_conninfo = 'host=jefferson port=5432 user=postgres'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: entering
standby mode
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: could
not open file "pg_xlog/0000000300000054000000DB" (log file 84, segment
219): No such file or directory
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record
known xact 36933672 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: remove
KnownAssignedXid 36933672
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record
known xact 36933674 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/17404; tid 18378/37
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ LOG: invalid
record length at 54/DDFE4010
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: remove
KnownAssignedXid 36929085
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:33:29.760915-05
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record
known xact 36929100 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/16436; tid 88370/2
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record
known xact 36929109 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/16436; tid 88370/3
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ LOG: invalid
magic number 0000 in log file 84, segment 219, offset 7733248
--
On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:
what does invalid record length and invalid magic number normally
means? xlog corrupted?
Thanks for any further help!
It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and the other is 64-bit, or one machine is big-endian and the other is little-endian...
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
thanks for the help!
are there any other possible reasons?
both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/Linux
and using the following program it tells both of them are little-endian
#include <stdio.h>
#include <stdbool.h>
bool isBigEndian()
{
int no = 1;
char *chk = (char *)&no;
if (chk[0] == 1)
{
return 0;
}
else
{
return 1;
}
}
main()
{
printf("this is %d \n",(int)isBigEndian());
return 0;
}
~
On Sat, Jul 23, 2011 at 11:55 PM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:
what does invalid record length and invalid magic number normally
means? xlog corrupted?
Thanks for any further help!It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and the other is 64-bit, or one machine is big-endian and the other is little-endian...
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
--
闫春路
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/Linux
It is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.
Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this
$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Tomas
On 07/23/2011 10:55 AM, Tomas Vondra wrote:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/LinuxIt is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Run a pg_config and compare the differences.
Tomas
--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/LinuxIt is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Different date formats too.
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...
running on master:
https://gist.github.com/1102148
running on slave:
https://gist.github.com/1102151
Show quoted text
On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/LinuxIt is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Different date formats too.
less postgres didn't showing anything... cause it's binary, I tried with -a
less postgres |grep -a ELF
less postgres | grep -a Class
nothing related to (ELF32 => 32bit, ELF64 => 64bit).
On Sun, Jul 24, 2011 at 1:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/LinuxIt is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Tomas
--
On Saturday, July 23, 2011 7:43:56 pm Yan Chunlu wrote:
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...running on master:
https://gist.github.com/1102148running on slave:
https://gist.github.com/1102151
Are you sure there is only one instance of Postgres running on each machine?
--
Adrian Klaver
adrian.klaver@gmail.com
the system date formats? looks the same:
master:
#date
Sat Jul 23 21:53:34 CDT 2011
slave:
#date
Sat Jul 23 21:52:56 CDT 2011
On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
thanks for the help!
are there any other possible reasons?both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/LinuxIt is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this$ less postgres | grep Class
to check this (ELF32 => 32bit, ELF64 => 64bit).
Did you use the same binary packages or have you built the server yourself?
Different date formats too.
--
闫春路
On Sat, Jul 23, 2011 at 8:53 PM, Yan Chunlu <springrider@gmail.com> wrote:
the system date formats? looks the same:
hehe, no, the internal formats. There's a floating point and an
integer method. They have to be the same and according to your output
of pg_config they are, with this config flag listed for both:
--enable-integer-datetimes
btw, integer is preferred over floating point for date types, at least
as far as I know.
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...
BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
thanks for all the help!
@Adrian: yes, only one instance on each machine
not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225
BTW: is that possible that rsync has finished but the data didn't
flush to disk, so when postgresql started it was seeing corrupted
files?
On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
Show quoted text
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
is there anything special you have configured on master and slave?
could I see the related configuration on your master and slave? such
as wal_keep_segments,checkpoint_segments or any other might be
related? thanks a lot!
On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100ms and no problems.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
--
闫春路
On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
thanks for all the help!
@Adrian: yes, only one instance on each machine
not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225
These errors just mean the master already removed WAL segments, so the
slave can't actually start the replication because there'd be a gap. This
usually happens with enough write activity (inserts, updates) when the
slave is being setup.
Whaht is your wal_keep_segments value? Increase it or set up WAL
archiving, so that the slave can get the data.
Tomas
checkpoint_segments = 64
wal_keep_segments = 128
Show quoted text
On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
thanks for all the help!
@Adrian: yes, only one instance on each machine
not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225These errors just mean the master already removed WAL segments, so the
slave can't actually start the replication because there'd be a gap. This
usually happens with enough write activity (inserts, updates) when the
slave is being setup.Whaht is your wal_keep_segments value? Increase it or set up WAL
archiving, so that the slave can get the data.Tomas
Hi
I am working on pgpool /postgresql/linux first time. All these three are
new to me.
I am running pgpool-HA using pacemaker-corosync.
I am having following error in my setup. Does any body have any idea
about the warning highlighted in logs ? Any type of help would be
appreciated a lot...
Jul 24 18:40:41 squarepant attrd: [3772]: info: attrd_perform_update:
Sent update 1010: probe_complete=true
Jul 24 18:40:41 squarepant cib: [3770]: WARN: cib_process_request:
Operation complete: op cib_modify for section status
(origin=local/attrd/1010, version=0.52.352): The object/attribute does
not exist (rc=-22)
Jul 24 18:40:41 squarepant attrd: [3772]: WARN: attrd_cib_callback:
Update 1010 for probe_complete=true failed: The object/attribute does
not exist
Jul 24 18:40:41 squarepant pgpool[7635]: INFO: pgpoolRA: request stop,
but not running.
Jul 24 18:40:41 squarepant crmd: [3896]: info: process_lrm_event: LRM
operation pgpool_stop_0 (call=339, rc=0, cib-update=472, confirmed=true) ok
Jul 24 18:40:41 squarepant crmd: [3896]: info: match_graph_event: Action
pgpool_stop_0 (2) confirmed on squarepant (rc=0)
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_rsc_command:
Initiating action 12: start pgpool_start_0 on squarepant (local)
Jul 24 18:40:41 squarepant crmd: [3896]: info: do_lrm_rsc_op: Performing
key=12:112:0:6d68acf3-ab99-409f-b686-8533e4b24ca0 op=pgpool_start_0 )
Jul 24 18:40:41 squarepant lrmd: [3771]: info: rsc:pgpool:340: start
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_pseudo_action: Pseudo
action 5 fired and confirmed
ouput of crm_mon command is
============
Last updated: Sun Jul 24 18:15:59 2011
Stack: Heartbeat
Current DC: squarepant (85f4f3d6-650e-4620-8cbb-edb1bc9d389c) -
partition with quorum
Version: 1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
3 Resources configured.
============
Online: [ squarepant ]
DBIP (ocf::heartbeat:IPaddr2): Started squarepant
postgresql (ocf::heartbeat:pgsql): Started squarepant
pgpool (ocf::heartbeat:pgpool): Started squarepant FAILED
Failed actions:
pgpool_monitor_30000 (node=squarepant, call=1874, rc=7,
status=complete): not running
Thanks & Regards
Sanjay
I did the SR procedure again, still no luck:
is that normal that after start slave postgresql, the first line of log is:
database system was interrupted; last known up at 2011-07-24 10:53:38 CDT??
4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG: database
system was interrupted; last known up at 2011-07-24 10:53:38 CDT
4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG: entering
standby mode
4762 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
[local]FATAL: the database system is starting up
4761 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT @ LOG: streaming
replication successfully connected to primary
4764 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
10.28.53.11(53442)FATAL: the database system is starting up
4770 2011-07-24 10:56:00 CDT 2011-07-24 10:56:00 CDT postgres@postgres
[local]FATAL: the database system is starting up
4802 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT postgres@postgres
[local]FATAL: the database system is starting up
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: redo
starts at 57/6B002028
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: invalid
record length at 57/6B20E010
4761 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: invalid
magic number 0000 in log file 87, segment 107, offset 2490368
4847 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
[local]FATAL: the database system is starting up
4850 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
10.28.53.11(53443)FATAL: the database system is starting up
4851 2011-07-24 10:56:03 CDT 2011-07-24 10:56:03 CDT postgres@postgres
[local]FATAL: the database system is starting up
4860 2011-07-24 10:56:04 CDT 2011-07-24 10:56:04 CDT postgres@postgres
[local]FATAL: the database system is starting up
4865 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT postgres@postgres
[local]FATAL: the database system is starting up
4859 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT @ LOG: streaming
replication successfully connected to primary
4874 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT postgres@postgres
[local]FATAL: the database system is starting up
4869 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT
postgres@template1 10.28.53.11(53444)FATAL: the database system is
starting up
4879 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT postgres@postgres
[local]FATAL: the database system is starting up
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG: invalid
record length at 57/6B2BA010
4859 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG: invalid
magic number 0000 in log file 87, segment 107, offset 2883584
4887 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT postgres@postgres
[local]FATAL: the database system is starting up
4888 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT @ LOG: streaming
replication successfully connected to primary
4892 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT postgres@postgres
[local]FATAL: the database system is starting up
4896 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT
postgres@template1 10.28.53.11(53445)FATAL: the database system is
starting up
4901 2011-07-24 10:56:10 CDT 2011-07-24 10:56:10 CDT postgres@postgres
[local]FATAL: the database system is starting up
4906 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT postgres@postgres
[local]FATAL: the database system is starting up
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG: invalid
record length at 57/6B486010
4888 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG: invalid
magic number 0000 in log file 87, segment 107, offset 4849664
Show quoted text
On Sun, Jul 24, 2011 at 8:46 PM, Yan Chunlu <springrider@gmail.com> wrote:
checkpoint_segments = 64
wal_keep_segments = 128On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
thanks for all the help!
@Adrian: yes, only one instance on each machine
not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225These errors just mean the master already removed WAL segments, so the
slave can't actually start the replication because there'd be a gap. This
usually happens with enough write activity (inserts, updates) when the
slave is being setup.Whaht is your wal_keep_segments value? Increase it or set up WAL
archiving, so that the slave can get the data.Tomas
Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
checkpoint_segments = 64
wal_keep_segments = 128
This information alone is not sufficient - we don't know how much write
activity is on the primary system, so we can't say if those number are
sufficient or not. You have to tune them according to write activity on
the primary server.
For example let's suppose the current WAL segment on the primary is "1"
and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
data).
Before you prepare and start the slave machine, someone writes 100MB of
data to the primary database (one big insert/update or a lot of small
ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
segment on the primary is 7, and because of wal_keep_segments there are
segments 3,4,5,6,7 available.
But when the slave connects, it asks for segment no. 2 and it's not
available. It's not possible to skip that segment so the replication
fails to start.
If the primary only received 60MB of data, it'd probably worked (there'd
be enough segments kept on the primary).
Those 128 segments is about 2GB of data. How much data is written on the
primary between creating a filesystem copy and starting the slave?
You don't neet to keep the files on the master, you can set up archiving
and keep them somewhere else (on a different system etc.).
Tomas
checkpoint_segments = 64
wal_keep_segments = 128
this setting seems is for 5GB capacity, I think there is noway I would
ever write 5GB data during the rsync progress.
I think the problem is still "invalid record length" and "invalid
magic number", it start showing right after I complete sync data and
start slave. If I stop slave later and restart, yes it could show
xlog not found and can not catch master. but why the "invalid" things
in the first place?
Show quoted text
On Mon, Jul 25, 2011 at 4:28 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
checkpoint_segments = 64
wal_keep_segments = 128This information alone is not sufficient - we don't know how much write
activity is on the primary system, so we can't say if those number are
sufficient or not. You have to tune them according to write activity on
the primary server.For example let's suppose the current WAL segment on the primary is "1"
and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
data).Before you prepare and start the slave machine, someone writes 100MB of
data to the primary database (one big insert/update or a lot of small
ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
segment on the primary is 7, and because of wal_keep_segments there are
segments 3,4,5,6,7 available.But when the slave connects, it asks for segment no. 2 and it's not
available. It's not possible to skip that segment so the replication
fails to start.If the primary only received 60MB of data, it'd probably worked (there'd
be enough segments kept on the primary).Those 128 segments is about 2GB of data. How much data is written on the
primary between creating a filesystem copy and starting the slave?You don't neet to keep the files on the master, you can set up archiving
and keep them somewhere else (on a different system etc.).Tomas
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general