logical replication snapshots
Hi everyone,
I'm playing with logical replication in postgres-10 and I got it to hang
on startup with
2018-07-23 13:15:06.577 CDT [18624] FATAL: the database system is starting up
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 on_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(1): 1 callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: exit(1)
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(-1): 0 callbacks to make
... forever ...
After poking around:
# find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c
12727975
It looks like ZFS simply can't handle this many files: `ls` etc. hang
forever.
So my question is, does it have to be that many? And if not, is there a
knob to keep the number down?
TIA
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Hi,
On 2018-07-23 15:23:04 -0500, Dimitri Maziuk wrote:
I'm playing with logical replication in postgres-10 and I got it to hang
on startup with2018-07-23 13:15:06.577 CDT [18624] FATAL: the database system is starting up
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(1): 0 on_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(1): 1 callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: exit(1)
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2018-07-23 13:15:06.577 CDT [18624] DEBUG: proc_exit(-1): 0 callbacks to make... forever ...
What precedes that "loop"?
Could you describe what exactly you did into that situation?
After poking around:
# find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c
12727975
Ugh, something absolutely clearly isn't right here.
It looks like ZFS simply can't handle this many files: `ls` etc. hang
forever.
ls probably works in general, it's just the sorting that makes it
hang. Should work with -f or such.
Greetings,
Andres Freund
On 7/23/2018 6:51 PM, Andres Freund wrote:
What precedes that "loop"?
systemctl start postgresql-10
Could you describe what exactly you did into that situation?
Created a database, few publications, as per TFM, and a few
subscriptions on another host. Went on vacation for 3 weeks. The problem
host is centos 7.current w/ postgres rpms from PGDG repo. Postgres is on
port 5434 because there is a stock one on 5432.
ls probably works in general, it's just the sorting that makes it
hang. Should work with -f or such.
I don't think this means what you think this means.
There are *13 million* files in
/var/lib/pgsql/10/data/pg_logical/snapshots and postgres hangs forever
on start with
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
...
I think the two are related. ICBW and all that. What do I do no not get
it into that state again?
Thx
On 2018-Jul-24, Dimitri Maziuk wrote:
On 7/23/2018 6:51 PM, Andres Freund wrote:
Could you describe what exactly you did into that situation?
Created a database, few publications, as per TFM, and a few subscriptions on
another host. Went on vacation for 3 weeks. The problem host is centos
7.current w/ postgres rpms from PGDG repo. Postgres is on port 5434 because
there is a stock one on 5432.
Are you serious about us trying to diagnose a bug with this description?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
Are you serious about us trying to diagnose a bug with this description?
What do you want to know, exactly?
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
Are you serious about us trying to diagnose a bug with this description?
What do you want to know, exactly?
A recipe that we can follow and reproduce the issue.
Greetings,
Andres Freund
On 07/24/2018 01:43 PM, Andres Freund wrote:
On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
Are you serious about us trying to diagnose a bug with this description?
What do you want to know, exactly?
A recipe that we can follow and reproduce the issue.
The nutshell version is as I said: I pretty much followed the manual to
create a test publication/subscription setup and let it run for a few
weeks. I had to increase a few limits but otherwise everything's at the
default settings as shipped in the rpms.
I can send you postgres config files from both nodes and the entire
database dump -- it's all public domain. I can `zfs send` you the
snapshot of the entire /var/lib/pgsql/10 as is, too.
If you are asking for something other than those two extremes, please
let me know what it is and I'll be happy to try and provide it.
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 2018-07-24 14:02:26 -0500, Dimitri Maziuk wrote:
On 07/24/2018 01:43 PM, Andres Freund wrote:
On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
Are you serious about us trying to diagnose a bug with this description?
What do you want to know, exactly?
A recipe that we can follow and reproduce the issue.
The nutshell version is as I said: I pretty much followed the manual to
create a test publication/subscription setup and let it run for a few
weeks. I had to increase a few limits but otherwise everything's at the
default settings as shipped in the rpms.
Are you really expecting us to be able to reproduce the problem based on
the above description? Our test suites do setup plain replications
setups, and the problem doesn't occur there.
If you are asking for something other than those two extremes, please
let me know what it is and I'll be happy to try and provide it.
A command by command reproducer would be good.
I can send you postgres config files from both nodes and the entire
database dump -- it's all public domain. I can `zfs send` you the
snapshot of the entire /var/lib/pgsql/10 as is, too.
Without the genesis, that's probably not as helpful, unfortunately.
Greetings,
Andres Freund
On 7/25/2018 10:28 AM, Andres Freund wrote:
Are you really expecting us to be able to reproduce the problem based on
the above description? Our test suites do setup plain replications
setups, and the problem doesn't occur there.
I don't, by definition, have a reproducible case: it only happened once
so far.
If nobody knows what limits the number of files created in
$PGDATA/pg_logical/snapshots then we'll all have to wait until this
happens again.
(To somebody else as I'm obviously not turning logical replication back
on until I know it won't kill my server again.)
Dima
On 07/25/2018 08:40 AM, Dimitri Maziuk wrote:
On 7/25/2018 10:28 AM, Andres Freund wrote:
Are you really expecting us to be able to reproduce the problem based on
the above description? Our test suites do setup plain replications
setups, and the problem doesn't occur there.I don't, by definition, have a reproducible case: it only happened once
so far.
Where you using pg_export_snapshot() by any chance?:
https://www.postgresql.org/docs/10/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION
Where there any relevant error messages in the log before the database hung?
If nobody knows what limits the number of files created in
$PGDATA/pg_logical/snapshots then we'll all have to wait until this
happens again.(To somebody else as I'm obviously not turning logical replication back
on until I know it won't kill my server again.)
Given that it took 3 weeks to manifest itself before, I would say give
it a try and monitor $PGDATA/pg_logical/snapshots. That would help
provide information for getting at the source of the problem. You can
always disable the replication if it looks like it running away.
Dima
--
Adrian Klaver
adrian.klaver@aklaver.com
On 07/25/2018 11:16 AM, Adrian Klaver wrote:
Where you using pg_export_snapshot() by any chance?:
If it's not in chapter 31 of TFM then I have not touched it. The only
notable thing I changed from 31.9 quickstart is that replica identity is
full on all tables (the data's a mess).
Where there any relevant error messages in the log before the database
hung?
Do you seriously expect me to know what "relevant" is? ;) -- It's 7
days' worth of
2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT
2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408
2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions.
2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT
2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440
2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions.
2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT
2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478
2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions.
2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT
that ends with
2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up
2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up
... ad infinitum ...
Would "permission denied" be relevant?
(Not sure where that is coming from: I did initial sync as postgres then
altered it to replication user. Can't track it down now since the
database is dead.)
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
On 07/25/2018 11:16 AM, Adrian Klaver wrote:
Where you using pg_export_snapshot() by any chance?:
If it's not in chapter 31 of TFM then I have not touched it. The only
notable thing I changed from 31.9 quickstart is that replica identity is
full on all tables (the data's a mess).
So no.
Where there any relevant error messages in the log before the database
hung?Do you seriously expect me to know what "relevant" is? ;) -- It's 7
days' worth of
Generally speaking relevancy increases the closer you get to the fall
over event. So we are most interested in entries close to the point
where the database stopped/failed.
2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT
2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408
2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions.
2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT
2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440
2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions.
2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT
2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478
2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions.
2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUTthat ends with
2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up
2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up... ad infinitum ...
Would "permission denied" be relevant?
Logical decoding is something I am still learning. The "permission
denied" would to me be relevant only to the extent that it seems to be
provoking:
"LOG: logical decoding found consistent point at 19/E6942440"
DETAIL: There are no running transactions."
Others with more experience in this area would need to fill whether that
might account for the 13 million files in the ~snapshot/ directory.
(Not sure where that is coming from: I did initial sync as postgres then
altered it to replication user. Can't track it down now since the
database is dead.)
In your previous post you said:
"(To somebody else as I'm obviously not turning logical replication back
on until I know it won't kill my server again.) "
So what are you working off now?
--
Adrian Klaver
adrian.klaver@aklaver.com
On 07/25/2018 02:31 PM, Adrian Klaver wrote:
In your previous post you said:
"(To somebody else as I'm obviously not turning logical replication back
on until I know it won't kill my server again.) "So what are you working off now?
I have that database rebuilt with no publications defined, and I have
the original $PGDATA on a ZFS volume.
I have not looked at the subscriber host yet, I've more urgent fires to
douse first.
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
On 07/25/2018 11:16 AM, Adrian Klaver wrote:
Where you using pg_export_snapshot() by any chance?:
If it's not in chapter 31 of TFM then I have not touched it. The only
notable thing I changed from 31.9 quickstart is that replica identity is
full on all tables (the data's a mess).Where there any relevant error messages in the log before the database
hung?Do you seriously expect me to know what "relevant" is? ;) -- It's 7
days' worth of2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT
2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408
2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions.
2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT
2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440
2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions.
2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT
2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478
2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions.
2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUT
A little slow, but I realized that the above looks like the initial sync
for a table. Do these errors stretch back all the way back to your
original sync?
that ends with
2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up
2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up... ad infinitum ...
Would "permission denied" be relevant?
(Not sure where that is coming from: I did initial sync as postgres then
altered it to replication user. Can't track it down now since the
database is dead.)
--
Adrian Klaver
adrian.klaver@aklaver.com
On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
On 07/25/2018 11:16 AM, Adrian Klaver wrote:
Where you using pg_export_snapshot() by any chance?:
If it's not in chapter 31 of TFM then I have not touched it. The only
notable thing I changed from 31.9 quickstart is that replica identity is
full on all tables (the data's a mess).So no.
Where there any relevant error messages in the log before the database
hung?Do you seriously expect me to know what "relevant" is? ;) -- It's 7
days' worth ofGenerally speaking relevancy increases the closer you get to the fall over
event. So we are most interested in entries close to the point where the
database stopped/failed.2018-07-23 09:43:12.851 CDT [7975] STATEMENT: COPY macromolecules."Entity_common_name" TO STDOUT
2018-07-23 09:43:12.925 CDT [7982] LOG: logical decoding found consistent point at 19/E6942408
2018-07-23 09:43:12.925 CDT [7982] DETAIL: There are no running transactions.
2018-07-23 09:43:12.935 CDT [7982] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:12.935 CDT [7982] STATEMENT: COPY macromolecules."Assembly" TO STDOUT
2018-07-23 09:43:13.034 CDT [7987] LOG: logical decoding found consistent point at 19/E6942440
2018-07-23 09:43:13.034 CDT [7987] DETAIL: There are no running transactions.
2018-07-23 09:43:13.044 CDT [7987] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.044 CDT [7987] STATEMENT: COPY macromolecules."Spectral_transition" TO STDOUT
2018-07-23 09:43:13.135 CDT [7990] LOG: logical decoding found consistent point at 19/E6942478
2018-07-23 09:43:13.135 CDT [7990] DETAIL: There are no running transactions.
2018-07-23 09:43:13.145 CDT [7990] ERROR: permission denied for schema macromolecules
2018-07-23 09:43:13.145 CDT [7990] STATEMENT: COPY macromolecules."Assembly_db_link" TO STDOUTthat ends with
2018-07-23 09:45:40.280 CDT [6032] LOG: database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
2018-07-23 09:45:40.281 CDT [6033] FATAL: the database system is starting up
2018-07-23 09:45:40.282 CDT [6034] FATAL: the database system is starting up... ad infinitum ...
Would "permission denied" be relevant?
Logical decoding is something I am still learning. The "permission denied"
would to me be relevant only to the extent that it seems to be provoking:
Yes, it looks related. Looks like logical rep was trying to perform the
intiial sync of those tables, and couldn't due to permission errors.
"LOG: logical decoding found consistent point at 19/E6942440"
DETAIL: There are no running transactions."Others with more experience in this area would need to fill whether that
might account for the 13 million files in the ~snapshot/ directory.
That indicates there's some error handling issues to be resolved. Petr?
Greetings,
Andres Freund
On 07/25/2018 07:57 PM, Andres Freund wrote:
On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
Would "permission denied" be relevant?
Logical decoding is something I am still learning. The "permission denied"
would to me be relevant only to the extent that it seems to be provoking:Yes, it looks related. Looks like logical rep was trying to perform the
intiial sync of those tables, and couldn't due to permission errors.
I'm not sure what happened, I remember the initial sync of that
particular schema failing on one table only, but looking at it now, all
tables are empty on the subscriber.
"LOG: logical decoding found consistent point at 19/E6942440"
DETAIL: There are no running transactions."Others with more experience in this area would need to fill whether that
might account for the 13 million files in the ~snapshot/ directory.That indicates there's some error handling issues to be resolved. Petr?
It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?
Would it be safe to delete every file in ~snapshot/ older than some
X-number of minutes? (And what a good X might be?)
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:
On 07/25/2018 07:57 PM, Andres Freund wrote:
On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
Would "permission denied" be relevant?
Logical decoding is something I am still learning. The "permission denied"
would to me be relevant only to the extent that it seems to be provoking:Yes, it looks related. Looks like logical rep was trying to perform the
intiial sync of those tables, and couldn't due to permission errors.I'm not sure what happened, I remember the initial sync of that
particular schema failing on one table only, but looking at it now, all
tables are empty on the subscriber.
To me that indicates all the syncs failed.
"LOG: logical decoding found consistent point at 19/E6942440"
DETAIL: There are no running transactions."Others with more experience in this area would need to fill whether that
might account for the 13 million files in the ~snapshot/ directory.That indicates there's some error handling issues to be resolved. Petr?
It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?
That is getting deeper into the code then I have been, so I don't know.
FYI the code is below and the part that triggered the error is around
line 762 (copy_table(Relation rel)).
Would it be safe to delete every file in ~snapshot/ older than some
X-number of minutes? (And what a good X might be?)
Again I don't know the answer to this. Are you trying this on a test
setup or production one?
--
Adrian Klaver
adrian.klaver@aklaver.com
On 2018-07-26 12:54:19 -0500, Dimitri Maziuk wrote:
"LOG: logical decoding found consistent point at 19/E6942440"
DETAIL: There are no running transactions."Others with more experience in this area would need to fill whether that
might account for the 13 million files in the ~snapshot/ directory.That indicates there's some error handling issues to be resolved. Petr?
It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?
No, it doesn't. The snapshots are internal to logical replication.
Would it be safe to delete every file in ~snapshot/ older than some
X-number of minutes? (And what a good X might be?)
You shouldn't manually delete them. But an explicit 'checkpoint;' (and
all automatically triggered ones), should do so for all that are
older. Is there a chance that you have some old leftover replication
slots around? Because that'd prevent cleanup?
What does SELECT * FROM pg_replication_slots ; return?
Greetings,
Andres Freund
On 07/26/2018 02:54 PM, Adrian Klaver wrote:
On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:
I'm not sure what happened, I remember the initial sync of that
particular schema failing on one table only, but looking at it now, all
tables are empty on the subscriber.To me that indicates all the syncs failed.
Yeah, well... the error message said one table failed and I went off to
find out why (a co-worker added a column behind everyone's back) and
never checked 'count(*)' on the other tables.
... deleting files in ~snapshots
Again I don't know the answer to this. Are you trying this on a test
setup or production one?
I could fire up another instance on a different port off the now-broken
$PGDATA easily enough and test. However if whatever uses those files
needs to start from the very first file and "replay" them in sequence,
that won't work.
The files are named like 19_E6942440.snap which presumably corresponds
to "LOG: logical decoding found consistent point at 19/E6942440" and
they seem to get progressively larger. That suggest that maybe just one
(the newest one) could be good enough...
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
On 07/26/2018 02:06 PM, Dimitri Maziuk wrote:
On 07/26/2018 02:54 PM, Adrian Klaver wrote:
On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:
I'm not sure what happened, I remember the initial sync of that
particular schema failing on one table only, but looking at it now, all
tables are empty on the subscriber.To me that indicates all the syncs failed.
Yeah, well... the error message said one table failed and I went off to
find out why (a co-worker added a column behind everyone's back) and
never checked 'count(*)' on the other tables.
The thing that has me somewhat confused is:
ERROR: permission denied for schema macromolecules
I would thought the replication user could access that.
Is there something special about that schema?
Did the user role you set up for replication have the REPLICATION
attribute or was it a superuser?
... deleting files in ~snapshots
Again I don't know the answer to this. Are you trying this on a test
setup or production one?I could fire up another instance on a different port off the now-broken
$PGDATA easily enough and test. However if whatever uses those files
needs to start from the very first file and "replay" them in sequence,
that won't work.
If I remember back correctly this is because the server chokes on the 13
million files in the ~/pg_logical/snapshot directly.
I wonder what happens if you change(on test setup)?:
wal_level = logical
to
wal_level = replica
The files are named like 19_E6942440.snap which presumably corresponds
to "LOG: logical decoding found consistent point at 19/E6942440" and
they seem to get progressively larger. That suggest that maybe just one
(the newest one) could be good enough...
--
Adrian Klaver
adrian.klaver@aklaver.com