streaming replication does not work across datacenter with 20ms latency?

Started by Yan Chunlualmost 15 years ago34 messagesgeneral

springrider@gmail.com

almost 15 years ago

I was doing postgresql streaming replication, which was fine when two
machine in the same datecenter. but recently I was planning to deploy
new slave at a different datecent, the latency between the master and
slave is 20ms;
below is the related configurateion:
Both master and slave have below configuration:
hot_standby = on
wal_level = hot_standby
max_wal_senders = 5

checkpoint_segments = 64
wal_keep_segments = 128

I am using pgpool to automation but the method is similar to the
method described here:
http://wiki.postgresql.org/wiki/Streaming_Replication

the data dir size is about 30G, I have tried many times but every
time after the sync was over and slave was started, postgresql is
just hanging there with error message(attached below), while trying to
connect it returns error message "psql: FATAL: the database system is
starting up"

the strange part is with same configuration, other slaves in the same
datacenter works fine...

what does invalid record length and invalid magic number normally
means? xlog corrupted?
Thanks for any further help!
the log message with debug5 level was like this(just clips, I could
upload full log file if necessary):

17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: database
system was interrupted; last known up at 2011-07-23 07:07:57 CDT
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked
new backend, pid=17998 socket=8
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked
new backend, pid=17999 socket=8
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]FATAL: the database system is starting up
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: shmem_exit(1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: proc_exit(1): 1 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: exit(1)
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: shmem_exit(-1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG: proc_exit(-1): 0 callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server
process (PID 17999) exited with exit code 1
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)FATAL: the database system is
starting up
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(1): 1 callbacks
to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: exit(1)
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(-1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(-1): 0
callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server
process (PID 17998) exited with exit code 1
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
standby_mode = 'on'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
primary_conninfo = 'host=jefferson port=5432 user=postgres'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: entering
standby mode
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: could
not open file "pg_xlog/0000000300000054000000DB" (log file 84, segment
219): No such file or directory

17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record
known xact 36933672 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: remove
KnownAssignedXid 36933672
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record
known xact 36933674 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/17404; tid 18378/37
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ LOG: invalid
record length at 54/DDFE4010

17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: remove
KnownAssignedXid 36929085
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo commit: 2011-07-23 06:33:29.760915-05
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record
known xact 36929100 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/16436; tid 88370/2
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record
known xact 36929109 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog
redo insert: rel 1663/16386/16436; tid 88370/3
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ LOG: invalid
magic number 0000 in log file 84, segment 219, offset 7733248