pg_basebackup: return value 1: reason?
Hello,
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to find out the
real reason? How to fix it?)
thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit)
I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued
from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D
${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr >>
${LOG_FILE}
# echo $?
1 <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed <------------------no
reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' ->
`../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf.save" <---------------recorded in
pg_log on master node and copied by pg_basebackup (note time difference
between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: database system was interrupted;
last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: database system was not properly
shut down; automatic recovery in progress <---------something missing from
pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG: redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: consistent recovery state reached
at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: database system is ready to accept
read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: started streaming WAL from primary
at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D
${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
<----what does this mean? I think it failed to start process to fetching
wal logs created during backup: but neither on master node neither on
pg_basebackup output here is any information about reason..
(max_wal_senders on master is 10: I see no reason to fail).
postgres logs:
`pg_xlog/0000000100000000000000E4' ->
`../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: database system was interrupted;
last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG: database system was not properly
shut down; automatic recovery in progress <------------this means
something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: consistent recovery state reached
at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: database system is ready to accept
read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: started streaming WAL from primary
at 0/E6000000 on timeline 1
postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on
On 04/15/2016 03:28 PM, Andrej Vanek wrote:
Hello,
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to find
out the real reason? How to fix it?)
First it is not clear to me where you are taking the backup from, the
master or the standby?
Second there is a lot of redirection going on. What happens if you run
the pg_basebackup directly (without doing su - postgres ...) and use
hardcoded values instead of shell variables?
thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit)I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr${LOG_FILE}
# echo $?
1 <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed <------------------no
reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' ->
`../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf.save" <---------------recorded in
pg_log on master node and copied by pg_basebackup (note time difference
between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: database system was
interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: database system was not
properly shut down; automatic recovery in progress <---------something
missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG: redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: consistent recovery state
reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: database system is ready to
accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: started streaming WAL from
primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
<----what does this mean? I think it failed to start process to
fetching wal logs created during backup: but neither on master node
neither on pg_basebackup output here is any information about reason..
(max_wal_senders on master is 10: I see no reason to fail).postgres logs:
`pg_xlog/0000000100000000000000E4' ->
`../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: database system was
interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG: database system was not
properly shut down; automatic recovery in progress <------------this
means something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: consistent recovery state
reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: database system is ready to
accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: started streaming WAL from
primary at 0/E6000000 on timeline 1postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrej Vanek <andrej.vanek.sk@gmail.com> writes:
Hello,Â
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to find out the real reason? How to fix it?)
Well there are more than 80 cases of exit 1 in a couple .c files and
most/all are preceeded unsurprisingly with an fprintf(stderr.
Are you not getting something informative on your display and/or sending
stderr to somewher else?
$ echo $*
./src/bin/pg_basebackup/pg_basebackup.c ./src/backend/replication/basebackup.c
$ grep -h -B 4 -i 'exit.*(1' $* | egrep -hi 'fprintf\(stderr|exit'
fprintf(stderr, _("%s: directory name too long\n"), progname);
exit(1);
fprintf(stderr, _("%s: multiple \"=\" signs in tablespace mapping\n"), progname);
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: old directory is not an absolute path in tablespace mapping: %s\n"),
exit(1);
fprintf(stderr, _("%s: new directory is not an absolute path in tablespace mapping: %s\n"),
exit(1);
fprintf(stderr, _("%s: could not read from ready pipe: %s\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create background process: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create background thread: %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not access directory \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: transfer rate must be greater than zero\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get COPY data stream: %s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not read COPY data: %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get COPY data stream: %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not read COPY data: %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: invalid tar block header size: %d\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create file \"%s\": %s\n"), progname, filename, strerror(errno));
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
exit(1);
fprintf(stderr, _("%s: incompatible server version %s\n"),
disconnect_and_exit(1);
disconnect_and_exit(1);
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not initiate base backup: %s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get backup header: %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: no data returned from server\n"), progname);
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: final receive failed: %s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not wait for child process: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: child %d died, expected %d\n"),
disconnect_and_exit(1);
if (!WIFEXITED(status))
fprintf(stderr, _("%s: child process did not exit normally\n"),
disconnect_and_exit(1);
if (WEXITSTATUS(status) != 0)
fprintf(stderr, _("%s: child process exited with error %d\n"),
progname, WEXITSTATUS(status));
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not wait for child thread: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get child thread exit status: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: child thread exited with error %u\n"),
disconnect_and_exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: invalid compression level \"%s\"\n"),
exit(1);
fprintf(stderr, _("%s: invalid checkpoint argument \"%s\", must be \"fast\" or \"spread\"\n"),
exit(1);
fprintf(stderr, _("%s: invalid status interval \"%s\"\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("%s: no target directory specified\n"), progname);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr, _("%s: transaction log directory location must be "
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: could not create symbolic link \"%s\": %s\n"),
exit(1);
fprintf(stderr, _("%s: symlinks are not supported on this platform\n"));
exit(1);
thanks, AndrejÂ
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit)I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr  >> ${LOG_FILE}
# echo $?
1 Â Â Â Â Â Â <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: Â skipping special file "./pg_hba.conf"
WARNING: Â skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed       <------------------no reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' -> `../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: Â skipping special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: Â skipping special file "./pg_hba.conf.save" Â Â Â Â <---------------recorded in pg_log on master node and
copied by pg_basebackup (note time difference between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: Â database system was interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: Â the database system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: Â entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: Â database system was not properly shut down; automatic recovery in progress <---------something missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG: Â redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: Â consistent recovery state reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: Â database system is ready to accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: Â started streaming WAL from primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"Â
# echo $?
1
#Â Â cat /var/log/cluster/geo_repair.log.errÂ
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: Â skipping special file "./pg_hba.conf"
WARNING: Â skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes     <----what does this mean? I think it failed to start process to fetching wal logs created
during backup: but neither on master node neither on pg_basebackup output here is any information about reason.. (max_wal_senders on master is 10: I see no reason to
fail).postgres logs:
`pg_xlog/0000000100000000000000E4' -> `../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: Â skipping special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: Â skipping special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: Â database system was interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: Â the database system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: Â entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not properly shut down; automatic recovery in progress  <------------this means something missing from
pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: Â redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: Â consistent recovery state reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: Â database system is ready to accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: Â started streaming WAL from primary at 0/E6000000 on timeline 1postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on
--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello,
my setup is:
1 master
1 synchronous slave (up and running)
2 asynchronous slave (up and running)
Now I'm setting up asynchronous slave: first step of this setup is
pg_basebackup connecting to master.
Indirections are because it is encapsulated in a script. This script is
aimed to serve for automated replication recovery. This script is launched
by crm_mon -d daemon.
You are right to check without indirections: now I tried to run the command
directly from command line without variables- it works fine...
This means: no problem in pg_basebackup itself
but some problem in environment of process launched from crm_mon daemon.
Thanks for your hint (remove indirections). Pg_basebackup works fine when
launched from command-line..
Unfortenutely I have still no clue how to solve the issue- su works. I have
to find out why pg_basebackup cannot fork when launched from crm_mon.
Best Regards, Andrej
2016-04-16 1:17 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 04/15/2016 03:28 PM, Andrej Vanek wrote:
Hello,
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to find
out the real reason? How to fix it?)First it is not clear to me where you are taking the backup from, the
master or the standby?Second there is a lot of redirection going on. What happens if you run the
pg_basebackup directly (without doing su - postgres ...) and use hardcoded
values instead of shell variables?thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit)I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr${LOG_FILE}
# echo $?
1 <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed <------------------no
reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' ->
`../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf.save" <---------------recorded in
pg_log on master node and copied by pg_basebackup (note time difference
between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: database system was
interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: database system was not
properly shut down; automatic recovery in progress <---------something
missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG: redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: consistent recovery state
reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: database system is ready to
accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: started streaming WAL from
primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
<----what does this mean? I think it failed to start process to
fetching wal logs created during backup: but neither on master node
neither on pg_basebackup output here is any information about reason..
(max_wal_senders on master is 10: I see no reason to fail).postgres logs:
`pg_xlog/0000000100000000000000E4' ->
`../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: database system was
interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG: database system was not
properly shut down; automatic recovery in progress <------------this
means something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: consistent recovery state
reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: database system is ready to
accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: started streaming WAL from
primary at 0/E6000000 on timeline 1postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on--
Adrian Klaver
adrian.klaver@aklaver.com
Hello,
Are you not getting something informative on your display and/or sending
stderr to somewher else?
- in case -X fetch no informative output
- in case -X stream following:
pg_basebackup: could not wait for child process: No child processes
Now after re-test directly from command-line it works. It fails just when
launched from a script fired by crm_mon -d -E my-script
Regards, Andrej
2016-04-16 1:18 GMT+02:00 Jerry Sievers <gsievers19@comcast.net>:
Show quoted text
Andrej Vanek <andrej.vanek.sk@gmail.com> writes:
Hello,
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to findout the real reason? How to fix it?)
Well there are more than 80 cases of exit 1 in a couple .c files and
most/all are preceeded unsurprisingly with an fprintf(stderr.Are you not getting something informative on your display and/or sending
stderr to somewher else?$ echo $*
./src/bin/pg_basebackup/pg_basebackup.c
./src/backend/replication/basebackup.c$ grep -h -B 4 -i 'exit.*(1' $* | egrep -hi 'fprintf\(stderr|exit'
fprintf(stderr, _("%s: directory name too
long\n"), progname);
exit(1);
fprintf(stderr, _("%s: multiple \"=\"
signs in tablespace mapping\n"), progname);
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: old directory is not an absolute
path in tablespace mapping: %s\n"),
exit(1);
fprintf(stderr, _("%s: new directory is not an absolute
path in tablespace mapping: %s\n"),
exit(1);
fprintf(stderr, _("%s: could not read from
ready pipe: %s\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create background
process: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create background thread:
%s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not access directory
\"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: transfer rate must be greater than
zero\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not write to file
\"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create file
\"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get COPY data stream:
%s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not read COPY data:
%s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get COPY data stream:
%s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not read COPY data:
%s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: invalid tar block
header size: %d\n"),
disconnect_and_exit(1);
fprintf(stderr,disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create
file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not write to
file \"%s\": %s\n"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: out of memory\n"), progname);
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not create file \"%s\":
%s\n"), progname, filename, strerror(errno));
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
exit(1);
fprintf(stderr, _("%s: incompatible server version %s\n"),
disconnect_and_exit(1);
disconnect_and_exit(1);
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not send replication command
\"%s\": %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not initiate base backup:
%s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get backup header: %s"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: no data returned from server\n"),
progname);
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: final receive failed: %s"),
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not wait for child
process: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: child %d died, expected
%d\n"),
disconnect_and_exit(1);
if (!WIFEXITED(status))
fprintf(stderr, _("%s: child process did not exit
normally\n"),
disconnect_and_exit(1);
if (WEXITSTATUS(status) != 0)
fprintf(stderr, _("%s: child process exited with
error %d\n"),
progname, WEXITSTATUS(status));
disconnect_and_exit(1);
fprintf(stderr,
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not wait for child
thread: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: could not get child thread
exit status: %s\n"),
disconnect_and_exit(1);
fprintf(stderr, _("%s: child thread exited with
error %u\n"),
disconnect_and_exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: invalid
compression level \"%s\"\n"),
exit(1);
fprintf(stderr, _("%s: invalid
checkpoint argument \"%s\", must be \"fast\" or \"spread\"\n"),
exit(1);
fprintf(stderr, _("%s: invalid
status interval \"%s\"\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for
more information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr, _("%s: no target directory specified\n"),
progname);
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr, _("%s: transaction log directory
location must be "
fprintf(stderr, _("Try \"%s --help\" for more
information.\n"),
exit(1);
fprintf(stderr,
exit(1);
fprintf(stderr, _("%s: could not create symbolic
link \"%s\": %s\n"),
exit(1);
fprintf(stderr, _("%s: symlinks are not supported on this
platform\n"));
exit(1);thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.720120313 (Red Hat 4.4.7-16), 64-bit)
I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr >>
${LOG_FILE}# echo $?
1 <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed <------------------noreason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' ->`../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping
special file "./pg_hba.conf.save" <---------------recorded in
pg_log on master node andcopied by pg_basebackup (note time difference between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: database system wasinterrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: database system was notproperly shut down; automatic recovery in progress <---------something
missing from pg_basebackup2016-04-15 23:15:03 CEST:@:[23321] LOG: redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: consistent recovery statereached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: database system is ready to
accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: started streaming WAL from
primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes<----what does this mean? I think it failed to start process to fetching
wal logs createdduring backup: but neither on master node neither on pg_basebackup
output here is any information about reason.. (max_wal_senders on master is
10: I see no reason tofail).
postgres logs:
`pg_xlog/0000000100000000000000E4' ->`../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: database system was
interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG: database system was notproperly shut down; automatic recovery in progress <------------this means
something missing frompg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: consistent recovery statereached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: database system is ready to
accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: started streaming WAL from
primary at 0/E6000000 on timeline 1
postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800
On 04/16/2016 02:24 PM, Andrej Vanek wrote:
Hello,
my setup is:
1 master
1 synchronous slave (up and running)
2 asynchronous slave (up and running)
Now I'm setting up asynchronous slave: first step of this setup is
pg_basebackup connecting to master.Indirections are because it is encapsulated in a script. This script is
aimed to serve for automated replication recovery. This script is
launched by crm_mon -d daemon.You are right to check without indirections: now I tried to run the
command directly from command line without variables- it works fine...This means: no problem in pg_basebackup itself
but some problem in environment of process launched from crm_mon daemon.
Thanks for your hint (remove indirections). Pg_basebackup works fine
when launched from command-line..
Unfortenutely I have still no clue how to solve the issue- su works. I
Is the su - even necessary?
pg_basebackup is a Postgres client program you can specify the user you
want it to connect to using -U.
Or do you need the script to run as postgres in order to get permissions
on wherever you are creating the backup directory?
have to find out why pg_basebackup cannot fork when launched from crm_mon.
I assume crm_mon is this:
http://linux.die.net/man/8/crm_mon
from Pacemaker.
I do not use Pacemaker, but I am pretty sure that running what is a
monitoring program in daemon mode and then shelling out to another
program is not workable. The docs seem to bear this out:
http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
Best Regards, Andrej
2016-04-16 1:17 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>:
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello Adrian,
I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup;
rsync; pg_stop_backup)- this works- the only downside is possible
collisions with on-line backup/synchronizaiton of other two nodes on master
node...
Back to the pg_basebackup issue: it is clear to me that this is an issue of
environment which launched pg_basebackup.
Possibly either some privileges or some kernel parameters/limits. Who
knows?
Summary: clusterlab's crm_mon launched a shell script starting
pg_basebackup which fails to do some its work (pg_basebackup: could not
wait for child process: No child processes)- probably due to some failing
system call.
How can I report to clusterlabs: What system call fails in pg_basebackup?
Best Regards, Andrej
2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
Is the su - even necessary?
pg_basebackup is a Postgres client program you can specify the user you
want it to connect to using -U.Or do you need the script to run as postgres in order to get permissions
on wherever you are creating the backup directory?have to find out why pg_basebackup cannot fork when launched from crm_mon.
I assume crm_mon is this:
http://linux.die.net/man/8/crm_mon
from Pacemaker.
I do not use Pacemaker, but I am pretty sure that running what is a
monitoring program in daemon mode and then shelling out to another program
is not workable. The docs seem to bear this out:http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
On 04/17/2016 12:13 PM, Andrej Vanek wrote:
Hello Adrian,
I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup;
rsync; pg_stop_backup)- this works- the only downside is possible
collisions with on-line backup/synchronizaiton of other two nodes on
master node...Back to the pg_basebackup issue: it is clear to me that this is an issue
of environment which launched pg_basebackup.
Possibly either some privileges or some kernel parameters/limits. Who
knows?
Summary: clusterlab's crm_mon launched a shell script starting
pg_basebackup which fails to do some its work (pg_basebackup: could not
wait for child process: No child processes)- probably due to some
failing system call.How can I report to clusterlabs: What system call fails in pg_basebackup?
All I can to do is point you at:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
Best Regards, Andrej
2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>:Is the su - even necessary?
pg_basebackup is a Postgres client program you can specify the user
you want it to connect to using -U.Or do you need the script to run as postgres in order to get
permissions on wherever you are creating the backup directory?have to find out why pg_basebackup cannot fork when launched
from crm_mon.I assume crm_mon is this:
http://linux.die.net/man/8/crm_mon
from Pacemaker.
I do not use Pacemaker, but I am pretty sure that running what is a
monitoring program in daemon mode and then shelling out to another
program is not workable. The docs seem to bear this out:http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello,
I've given a try once again.
Two variants used in my script (launched by crm_mon):
1. /usr/pgsql-9.5/bin/pg_basebackup -U pgreplic -h db-other-site -w -D
/opt/geo_stdby_data -c fast -vvv -X stream &>> /tmp/log
2. strace -o /tmp/pg_basebackup.log /usr/pgsql-9.5/bin/pg_basebackup -U
pgreplic -h db-other-site -w -D /opt/geo_stdby_data -c fast -vvv -X stream
&>> /tmp/log
Result:
variant 2. works fine with return code 0 (with strace)
variant 1. fails with error code 1 (without strace)
Any ideas?
Andrej
----------------------details
Output:
Variant 2:
DEBUG: CommitTransaction
DEBUG: name: unnamed; blockState: STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup' FAST NOWAIT
-- Mon May 23 17:54:31 CEST 2016 [l1abrnch->l1abrnch:3122/27282:GEO]
--INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0
(OK)
transaction log start point: 0/FA000028 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG: CommitTransaction
DEBUG: name: unnamed; blockState: STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 0/FA000000 TIMELINE
1
WARNING: skipping special file "./pg_hba.conf"
DEBUG: standby "pg_basebackup" has now caught up with primary
DEBUG: write 0/FA000000 flush 0/0 apply 0/0
DEBUG: removing transaction log backup history file
"0000000100000000000000F8.00000028.backup"
transaction log end point: 0/FA0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
RETVAL=0
Output
Variant 1:
DEBUG: CommitTransaction
DEBUG: name: unnamed; blockState: STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup' FAST NOWAIT
-- Mon May 23 17:55:32 CEST 2016 [l1abrnch->l1abrnch:3122/28785:GEO]
--INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0
(OK)
transaction log start point: 0/FC000028 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG: CommitTransaction
DEBUG: name: unnamed; blockState: STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 0/FC000000 TIMELINE
1
WARNING: skipping special file "./pg_hba.conf"
DEBUG: standby "pg_basebackup" has now caught up with primary
DEBUG: write 0/FC000000 flush 0/0 apply 0/0
DEBUG: removing transaction log backup history file
"0000000100000000000000FA.00000028.backup"
transaction log end point: 0/FC0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
RETVAL=1
2016-04-18 16:12 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 04/17/2016 12:13 PM, Andrej Vanek wrote:
Hello Adrian,
I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup;
rsync; pg_stop_backup)- this works- the only downside is possible
collisions with on-line backup/synchronizaiton of other two nodes on
master node...Back to the pg_basebackup issue: it is clear to me that this is an issue
of environment which launched pg_basebackup.
Possibly either some privileges or some kernel parameters/limits. Who
knows?
Summary: clusterlab's crm_mon launched a shell script starting
pg_basebackup which fails to do some its work (pg_basebackup: could not
wait for child process: No child processes)- probably due to some
failing system call.How can I report to clusterlabs: What system call fails in pg_basebackup?
All I can to do is point you at:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
Best Regards, Andrej
2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>:Is the su - even necessary?
pg_basebackup is a Postgres client program you can specify the user
you want it to connect to using -U.Or do you need the script to run as postgres in order to get
permissions on wherever you are creating the backup directory?have to find out why pg_basebackup cannot fork when launched
from crm_mon.I assume crm_mon is this:
http://linux.die.net/man/8/crm_mon
from Pacemaker.
I do not use Pacemaker, but I am pretty sure that running what is a
monitoring program in daemon mode and then shelling out to another
program is not workable. The docs seem to bear this out:http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
--
Adrian Klaver
adrian.klaver@aklaver.com