No such file or directory in pg_replslot
I don't know if this applies only to pglogical or logical decoding in
general. This is on a 9.6.10 provider running pglogical 2.2.0. Subscriber
has same versions. We had a replication delay situation this morning,
which I think may have been due to a really long transaction but I've yet
to verify that.
I disabled and re-enabled replication and at one point, this created an
error on start_replication_slot that the pid was already active.
Somehow replication got wedged and now even though replication appears to
be working, strace shows these kinds of errors continually:
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F4000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F5000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F6000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F7000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F8000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F9000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FA000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FB000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FC000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FD000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FE000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
Any suggestions? This is a showstopper for us.
Thank you,
Jeremy
On December 8, 2018 9:08:09 AM PST, Jeremy Finzel <finzelj@gmail.com> wrote:
I don't know if this applies only to pglogical or logical decoding in
general. This is on a 9.6.10 provider running pglogical 2.2.0.
Subscriber
has same versions. We had a replication delay situation this morning,
which I think may have been due to a really long transaction but I've
yet
to verify that.I disabled and re-enabled replication and at one point, this created an
error on start_replication_slot that the pid was already active.Somehow replication got wedged and now even though replication appears
to
be working, strace shows these kinds of errors continually:
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F4000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F5000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F6000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F7000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F8000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-F9000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FA000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FB000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FC000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FD000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/pgl_foo_providerb97b25d_foo336ddc1/xid-1248981532-lsn-C940-FE000000.snap",
O_RDONLY) = -1 ENOENT (No such file or directory)Any suggestions? This is a showstopper for us.
That doesn't indicate an error. You need to provide more details what made you consider things wedged...
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
That doesn't indicate an error. You need to provide more details what
made you consider things wedged...Andres
Thank you very much for the reply. We typically see no visible replication
delay over 5 minutes ever. Today we saw a delay of over 3 hours, and no
obvious increase in workload either on the provider or the subscriber. I
also did not see the LSN advancing whatsoever in terms of applying changes.
I first checked for long-running transactions on the master but there was
nothing too unusual except an ANALYZE which I promptly killed, but with no
improvement to the situation.
I found the messages above using strace after canceling the subscription
and finding that the process was taking extremely long to cancel. There
are 2.1 million files in pg_replslot which I don't think is normal? Any
ideas as to where I should be looking or what could cause this?
Thanks,
Jeremy
On Sat, Dec 8, 2018 at 1:21 PM Jeremy Finzel <finzelj@gmail.com> wrote:
That doesn't indicate an error. You need to provide more details what
made you consider things wedged...
Andres
Thank you very much for the reply. We typically see no visible
replication delay over 5 minutes ever. Today we saw a delay of over 3
hours, and no obvious increase in workload either on the provider or the
subscriber. I also did not see the LSN advancing whatsoever in terms of
applying changes.I first checked for long-running transactions on the master but there was
nothing too unusual except an ANALYZE which I promptly killed, but with no
improvement to the situation.I found the messages above using strace after canceling the subscription
and finding that the process was taking extremely long to cancel. There
are 2.1 million files in pg_replslot which I don't think is normal? Any
ideas as to where I should be looking or what could cause this?Thanks,
Jeremy
I have very good news in that waiting it out for several hours, it resolved
itself. Thank you, your input steered us in the right direction!
Jeremy
On 12/8/18 8:21 PM, Jeremy Finzel wrote:
There are 2.1 million files in pg_replslot which I don't think is
normal? Any ideas as to where I should be looking or what could cause this?
Postgres spills changes on disk when you have a big transaction:
https://blog.anayrat.info/en/2018/03/10/logical-replication-internals/
You can monitor it with check_pgactivity's replication_slots service:
https://github.com/OPMDG/check_pgactivity/blob/master/check_pgactivity#L5664
(You have to use master version, this feature has not been released yet)