Unexpected trouble from pg_basebackup

Started by otheus uibkover 9 years ago4 messagesgeneral

otheus.uibk@gmail.com

over 9 years ago

I recently updated my systems from pg 9.1.8 to 9.5.3. A pg_dumpall was used
to migrate the data. Now I'm trying to re-establish replication between
master and slave. I'm getting stuck.

When I run pg_basebackup (via a script which worked flawlessly on 9.1.8,
AND via command line, ala "manual mode"), it creates the PGDATA directory
structure, some files, but soon gets stuck. It writes fewer than 40 MB of a
20GB database. Further, using tcpdump to watch packets on the wire, I
confirm no data is going across.

However, it's clear a connection has been made and the process was started.
On the master, I see the process table:

postgres: wal sender process pgsync A.B.C.D(42821) sending backup

Meanwhile, on the client, I see no output updates. Here's the command:

pg_basebackup -x -P -v -D $PGDATA -w

PGUSER, PGPASSWORD and PGHOST are set accordingly. The User is a user with
the REPLICATION attribute. Just in case, I dropped and re-created the user.

So that's question A.

Question B is related. In attempting to verify that the permissions and HBA
were set correctly, I attempted to do a pg_dump using the same PGUSER and
PGHOST. What I got surprised me:

$ pg_dump -s onyxeditor
pg_dump: [archiver (db)] query failed: ERROR: permission denied for
relation licence
pg_dump: [archiver (db)] query was: LOCK TABLE public.licence IN ACCESS
SHARE MODE

This is the replication user. Other schemas worked just fine, but not this
one. Is this expected behavior?

Kind regards,
Otheus

--
Otheus
otheus.uibk@gmail.com
otheus.shelling@uibk.ac.at

otheus uibk

otheus.uibk@gmail.com

over 9 years ago

In reply to: otheus uibk (#1)

Re: Unexpected trouble from pg_basebackup

After a 3 to 4 minute delay, pg_basebackup started doing it's thing and
finished within a few minutes. So now the question is: why the startup
delay?

Magnus Hagander

magnus@hagander.net

over 9 years ago

In reply to: otheus uibk (#2)

Re: Unexpected trouble from pg_basebackup

On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com> wrote:

After a 3 to 4 minute delay, pg_basebackup started doing it's thing and
finished within a few minutes. So now the question is: why the startup
delay?

Sounds to me like it's doing a CHECKPOINT with spreading, which make it
take time. Try with "-c fast" and see if the problem goes away.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Magnus Hagander

magnus@hagander.net

over 9 years ago

In reply to: otheus uibk (#1)

Re: Unexpected trouble from pg_basebackup

On Wed, Oct 5, 2016 at 3:55 PM, otheus uibk <otheus.uibk@gmail.com> wrote:

On Tue, Oct 4, 2016 at 10:49 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com>
wrote:

After a 3 to 4 minute delay, pg_basebackup started doing it's thing and
finished within a few minutes. So now the question is: why the startup
delay?

Sounds to me like it's doing a CHECKPOINT with spreading, which make it
take time. Try with "-c fast" and see if the problem goes away.

Maybe not too far off.

2016-10-04 17:43:40.620 GMT 57eb90a0.6e07
402 00000 LOG: checkpoint complete: wrote 12799 buffers (1.0%); 0
transaction log file(s) added, 0 removed, 5 recycled; write=1389.348 s,
sync=0.033 s, total=1389.400 s; sync files=240, longest=0.003 s,
average=0.000 s; distance=92915 kB, estimate=129373 kB
2016-10-04 18:20:31.714 GMT 57eb90a0.6e07
403 00000 LOG: checkpoint starting: time
2016-10-04 18:39:19.870 GMT 57eb90a0.6e07
404 00000 LOG: checkpoint complete: wrote 10265 buffers (0.8%); 0
transaction log file(s) added, 0 removed, 4 recycled; write=1128.118 s,
sync=0.023 s, total=1128.155 s; sync files=190, longest=0.002 s,
average=0.000 s; distance=73419 kB, estimate=123778 kB
2016-10-04 19:07:12.647 GMT [unknown] pgsync 57f3fde0.52e3 2
00000 LOG: replication connection authorized: user=pgsync
2016-10-04 19:07:12.703 GMT 57eb90a0.6e07
405 00000 LOG: checkpoint starting: force wait
2016-10-04 19:20:32.879 GMT [unknown] pgsync 57f40100.5891 2
00000 LOG: replication connection authorized: user=pgsync
2016-10-04 19:23:05.249 GMT 57eb90a0.6e07
406 00000 LOG: checkpoint complete: wrote 8638 buffers (0.7%); 0
transaction log file(s) added, 0 removed, 5 recycled; write=952.514 s,
sync=0.016 s, total=952.546 s; sync files=238, longest=0.002 s,
average=0.000 s; distance=68257 kB, estimate=118226 kB
2016-10-04 19:23:05.249 GMT 57eb90a0.6e07
407 00000 LOG: checkpoint starting: force wait
2016-10-04 19:28:52.232 GMT 57eb90a0.6e07
408 00000 LOG: checkpoint complete: wrote 3102 buffers (0.2%); 0
transaction log file(s) added, 0 removed, 0 recycled; write=346.957 s,
sync=0.018 s, total=346.982 s; sync files=149, longest=0.002 s,
average=0.000 s; distance=36016 kB, estimate=110005 kB
2016-10-04 19:30:31.922 GMT [unknown] pgsync 57f40357.5c70 2
00000 LOG: replication connection authorized: user=pgsync
2016-10-04 19:30:31.932 GMT 57eb90a0.6e07
409 00000 LOG: checkpoint starting: force wait
2016-10-04 19:33:40.857 GMT 57eb90a0.6e07
410 00000 LOG: checkpoint complete: wrote 1763 buffers (0.1%); 0
transaction log file(s) added, 0 removed, 1 recycled; write=188.886 s,
sync=0.030 s, total=188.924 s; sync files=125, longest=0.004 s,
average=0.000 s; distance=13135 kB, estimate=100318 kB

OK, so what was happening is that the checkpoints were taking 5 to 15
minutes minutes, and aborted basebackups were triggering new checkpoints
which waited on the previous ones.

Is it possible the new environment is not high-performance enough??

possibly relevant configuration change:
checkpoint_timeout=1h

pg_basebackup will always send in a checkpoint. PostgreSQL will time that
to take approximately checkpoint_timeout * checkpoint_completion_target
time. So with checkpoint_timeout set to a very high value, it will by
default target something like 30 minutes before it even gets started. The
only reason it takes as *little* as it does is that your system is lightly
loaded.

You can do a fast checkpoint with the parameter I suggested before, or you
can tune your checkpoint_timeout to be something that gives you a more
reasonable time.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Import Notes

Reply to msg id not found: CALbQNd2nNzjaL4D2hNWkCaU=0CRJrShkmW_CSvdUKT=S7ZDi=Q@mail.gmail.com