Errors with schema migration and logical replication — expected?
Hi, first time poster.
I just ran into a rather messy problem when doing a schema migration with
logical replication. I'm not entirely sure what went wrong, why, or how to
prevent it in the future. The migration I ran was pretty simple (though
auto-generated by Django):
BEGIN;ALTER TABLE "search_docketentry" ADD COLUMN
"pacer_sequence_number" smallint NULL;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;ALTER TABLE
"search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50)
DEFAULT '' NOT NULL;ALTER TABLE "search_docketentry" ALTER COLUMN
"recap_sequence_number" DROP DEFAULT;ALTER TABLE "search_docketentry"
ALTER COLUMN "entry_number" DROP NOT NULL;ALTER TABLE
"search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT
'';ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number"
DROP DEFAULT;ALTER TABLE "search_docketentry" DROP CONSTRAINT
"search_docketentry_docket_id_12fd448b9aa007ca_uniq";CREATE INDEX
"search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON
"search_docketentry" ("recap_sequence_number", "entry_number");CREATE
INDEX "search_docketentry_eb19fcf7" ON "search_docketentry"
("pacer_sequence_number");CREATE INDEX "search_docketentry_bff4d47b"
ON "search_docketentry" ("recap_sequence_number");CREATE INDEX
"search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON
"search_docketentry" ("recap_sequence_number" varchar_pattern_ops);
COMMIT;
BEGIN;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;
COMMIT;
And after running this migration, I started getting this error on the
subscriber:
2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains
(48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07
04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical
replication worker for subscription 18390 (PID 13373) exited with exit
code 1
So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.
I applied the above migration simultaneously on my publisher and subscriber
thinking that postgresql was smart enough to do the right thing. I think
the subscriber finished first (it has less traffic).
The docs hint that postgresql might be smart enough to not worry about the
order you do migrations:
*Logical replication is robust when schema definitions change in a live
database:* When the schema is changed on the publisher and replicated data
starts arriving at the subscriber but does not fit into the table schema,
replication will error until the schema is updated.
And it even hints that doing a migration on the subscriber first is a good
thing in some cases:
In many cases, intermittent errors can be avoided by applying additive
schema changes to the subscriber first.
But I'm now supremely skeptical that doing anything at the subscriber first
is a good idea. Are the docs wrong? Does the above error make sense? Is the
process for schema migrations documented somewhere beyond the above?
I have lots of questions because I thought this would have gone smoother
than it did.
As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with the
publisher. Anybody interested in following along with all this (or finding
this later and having questions) can follow the issue here:
https://github.com/freelawproject/courtlistener/issues/919
Thank you for the lovely database! I hope this is helpful.
Mike
On 12/8/18 11:26 PM, Mike Lissner wrote:
Hi, first time poster.
I just ran into a rather messy problem when doing a schema migration
with logical replication. I'm not entirely sure what went wrong, why, or
how to prevent it in the future. The migration I ran was pretty simple
(though auto-generated by Django):
And after running this migration, I started getting this error on the
subscriber:|2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started 2018-12-09
05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint 2018-12-09 05:59:45
UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07
04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214,
null, null). 2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process:
logical replication worker for subscription 18390 (PID 13373) exited
with exit code 1|So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.
The above seems to be the crux of the problem, how did NULL get into the
column data?
The DDL migration did what it was supposed to do.
I applied the above migration simultaneously on my publisher and
subscriber thinking that postgresql was smart enough to do the right
thing. I think the subscriber finished first (it has less traffic).
And it did as far as the schema changes where concerned.
The docs hint that postgresql might be smart enough to not worry about
the order you do migrations:/Logical replication is robust when schema definitions change in a
live database:/ When the schema is changed on the publisher and
replicated data starts arriving at the subscriber but does not fit into
the table schema, replication will error until the schema is updated.And it even hints that doing a migration on the subscriber first is a
good thing in some cases:In many cases, intermittent errors can be avoided by applying
additive schema changes to the subscriber first.
But I'm now supremely skeptical that doing anything at the subscriber
first is a good idea. Are the docs wrong? Does the above error make
sense? Is the process for schema migrations documented somewhere beyond
the above?I have lots of questions because I thought this would have gone smoother
than it did.As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with
the publisher. Anybody interested in following along with all this (or
finding this later and having questions) can follow the issue here:https://github.com/freelawproject/courtlistener/issues/919
Thank you for the lovely database! I hope this is helpful.
Mike
--
Adrian Klaver
adrian.klaver@aklaver.com
The above seems to be the crux of the problem, how did NULL get into the
column data?
I agree. My queries are generated by Django (so I never write SQL myself),
but:
- the column has always been NOT NULL for its entire lifetime
- we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
- the publisher now has a NOT NULL constraint on that column. I never had
to clear out null values to put it in place. I assume that if that column
ever had a null value and I tried to run a DDL to add a null constraint,
the DDL would have failed, right?
Something feels wrong here, the more I think about it.
On 12/9/18 8:03 AM, Mike Lissner wrote:
The above seems to be the crux of the problem, how did NULL get into
the
column data?I agree. My queries are generated by Django (so I never write SQL
myself), but:- the column has always been NOT NULL for its entire lifetime
The lifetime being since the migration did this?:
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;
Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?
- we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
- the publisher now has a NOT NULL constraint on that column. I never
had to clear out null values to put it in place. I assume that if that
This part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.
column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?Something feels wrong here, the more I think about it.
A start would be to figure out what generated?:
failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)
--
Adrian Klaver
adrian.klaver@aklaver.com
On Sun, Dec 9, 2018 at 8:43 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 12/9/18 8:03 AM, Mike Lissner wrote:
The above seems to be the crux of the problem, how did NULL get into
the
column data?I agree. My queries are generated by Django (so I never write SQL
myself), but:- the column has always been NOT NULL for its entire lifetime
The lifetime being since the migration did this?:
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;
"Lifetime" meaning that there was never a time when this column allowed
nulls.
Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?
Good idea, but no. This column only exists in one table.
- we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
- the publisher now has a NOT NULL constraint on that column. I never
had to clear out null values to put it in place. I assume that if thatThis part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.
Sorry, what I mean is that if *somehow* the master had null values in that
column at some point, which I don't know how would even be possible because
it only came into existence with the command above — if somehow that
happened, I'd know, because I wouldn't have been *able* to add a NULL
constraint without first fixing the data in that column, which I never did.
My contention is that for all these reasons, there should *never* have been
a null value in that column on master.
column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?Something feels wrong here, the more I think about it.
A start would be to figure out what generated?:
failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)
Yes, I completely agree. I can't think of any way that that should have
ever been created.
Show quoted text
--
Adrian Klaver
adrian.klaver@aklaver.com
On 12/9/18 10:47 AM, Mike Lissner wrote:
My contention is that for all these reasons, there should *never* have
been a null value in that column on master.column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?Something feels wrong here, the more I think about it.
A start would be to figure out what generated?:
failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)Yes, I completely agree. I can't think of any way that that should have
ever been created.
1) Using psql have you verified that NOT NULL is set on that column on
the publisher?
2) And that the row that failed in the subscriber is in the publisher table.
3) That there are no NULL values in the publisher column?
Whatever the answers to 1), 2) and 3) are the next question is:
4) Do you want/need recap_sequence_number to be NOT NULL.
a) If not then you could leave things as they are.
b) If so then you:
1) Have to figure out what is sending NULL values to the column.
Maybe a model that has null=True set when it shouldn't be?
A Form/ModelForm that is allowing None/Null?
Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.
2) Clean up the NULL values in the column in the subscriber
and/or publisher.
--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
--
Adrian Klaver
adrian.klaver@aklaver.com
On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
1) Using psql have you verified that NOT NULL is set on that column on
the publisher?
Yes, on the publisher and the subscriber. That was my first step when I saw
the log lines about this.
2) And that the row that failed in the subscriber is in the publisher table.
Yep, it's there (though it doesn't show a null for that column, and I don't
know how it ever could have).
3) That there are no NULL values in the publisher column?
This on the publisher:
select * from search_docketentry where recap_sequence_number is null;
returns zero rows, so yeah, no nulls in there (which makes sense since
they're not allowed).
Whatever the answers to 1), 2) and 3) are the next question is:
4) Do you want/need recap_sequence_number to be NOT NULL.
Yes, and indeed that's how it always has been.
a) If not then you could leave things as they are.
Well, I was able to fix this by briefly allowing nulls on the subscriber,
letting it catch up with the publisher, setting all nulls to empty strings
(a Django convention), and then disallowing nulls again. After letting it
catch up, there were 118 nulls on the subscriber in this column:
https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185
That shouldn't be possible since nulls were never allowed in this column on
the publisher.
b) If so then you:
1) Have to figure out what is sending NULL values to the column.
Maybe a model that has null=True set when it shouldn't be?
Nope, never had that. I'm 100% certain.
A Form/ModelForm that is allowing None/Null?
Even if that was the case, the error wouldn't have shown up on the
subscriber since that null would have never been allowed in the publisher.
But anyway, I don't use any forms with this column.
Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.
That's an idea, but like I said, nothing sends SQL to the subscriber (not
even read requests), and this shouldn't have been possible in the publisher
due to the NOT NULL constraint that has *always* been on that column.
2) Clean up the NULL values in the column in the subscriber
and/or publisher.
There were only NULL values in the subscriber, never in the publisher.
Something is amiss here.
I appreciate all the responses. I'm scared to say so, but I think this is a
bug in logical replication. Somehow a null value appeared at the subscriber
that was never in the publisher.
I also still have this question/suggestion from my first email:
Is the process for schema migrations documented somewhere beyond the
above?
Thank you again,
Mike
Reupping this since it was over the weekend and looks like a bug in logical
replication. My problems are solved, but some very weird things happened
when doing a schema migration.
On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner <mlissner@michaeljaylissner.com>
wrote:
Show quoted text
On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:1) Using psql have you verified that NOT NULL is set on that column on
the publisher?Yes, on the publisher and the subscriber. That was my first step when I
saw the log lines about this.2) And that the row that failed in the subscriber is in the publisher
table.
Yep, it's there (though it doesn't show a null for that column, and I
don't know how it ever could have).3) That there are no NULL values in the publisher column?
This on the publisher:
select * from search_docketentry where recap_sequence_number is null;
returns zero rows, so yeah, no nulls in there (which makes sense since
they're not allowed).Whatever the answers to 1), 2) and 3) are the next question is:
4) Do you want/need recap_sequence_number to be NOT NULL.
Yes, and indeed that's how it always has been.
a) If not then you could leave things as they are.
Well, I was able to fix this by briefly allowing nulls on the subscriber,
letting it catch up with the publisher, setting all nulls to empty strings
(a Django convention), and then disallowing nulls again. After letting it
catch up, there were 118 nulls on the subscriber in this column:https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185
That shouldn't be possible since nulls were never allowed in this column
on the publisher.b) If so then you:
1) Have to figure out what is sending NULL values to the column.
Maybe a model that has null=True set when it shouldn't be?
Nope, never had that. I'm 100% certain.
A Form/ModelForm that is allowing None/Null?
Even if that was the case, the error wouldn't have shown up on the
subscriber since that null would have never been allowed in the publisher.
But anyway, I don't use any forms with this column.Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.That's an idea, but like I said, nothing sends SQL to the subscriber (not
even read requests), and this shouldn't have been possible in the publisher
due to the NOT NULL constraint that has *always* been on that column.2) Clean up the NULL values in the column in the subscriber
and/or publisher.
There were only NULL values in the subscriber, never in the publisher.
Something is amiss here.I appreciate all the responses. I'm scared to say so, but I think this is
a bug in logical replication. Somehow a null value appeared at the
subscriber that was never in the publisher.I also still have this question/suggestion from my first email:
Is the process for schema migrations documented somewhere beyond the
above?
Thank you again,
Mike
On 12/11/18 2:21 PM, Mike Lissner wrote:
Reupping this since it was over the weekend and looks like a bug in
logical replication. My problems are solved, but some very weird things
happened when doing a schema migration.On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner
<mlissner@michaeljaylissner.com <mailto:mlissner@michaeljaylissner.com>>
wrote:On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:1) Using psql have you verified that NOT NULL is set on that
column on
the publisher?Yes, on the publisher and the subscriber. That was my first step
when I saw the log lines about this.2) And that the row that failed in the subscriber is in the
publisher table.Yep, it's there (though it doesn't show a null for that column, and
I don't know how it ever could have).3) That there are no NULL values in the publisher column?
This on the publisher:
select * from search_docketentry where recap_sequence_number is null;
returns zero rows, so yeah, no nulls in there (which makes sense
since they're not allowed).Whatever the answers to 1), 2) and 3) are the next question is:
4) Do you want/need recap_sequence_number to be NOT NULL.
Yes, and indeed that's how it always has been.
a) If not then you could leave things as they are.
Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting all
nulls to empty strings (a Django convention), and then disallowing
nulls again. After letting it catch up, there were 118 nulls on the
subscriber in this column:
So recap_sequence_number is not actually a number, it is a code?
I appreciate all the responses. I'm scared to say so, but I think
this is a bug in logical replication. Somehow a null value appeared
at the subscriber that was never in the publisher.I also still have this question/suggestion from my first email:
Is the process for schema migrations documented somewhere beyond
the above?
Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or sqlmigrate.
Thank you again,
Mike
--
Adrian Klaver
adrian.klaver@aklaver.com
On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting all
nulls to empty strings (a Django convention), and then disallowing
nulls again. After letting it catch up, there were 118 nulls on the
subscriber in this column:So recap_sequence_number is not actually a number, it is a code?
It has sequential values, but they're not necessarily numbers.
I appreciate all the responses. I'm scared to say so, but I think
this is a bug in logical replication. Somehow a null value appeared
at the subscriber that was never in the publisher.I also still have this question/suggestion from my first email:
Is the process for schema migrations documented somewhere beyond
the above?
Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.
Pretty simple/standard, I think:
- Changed the code.
- Generated the migration using manage.py makemigration
- Generated the SQL using sqlmigrate
- Ran the migration using manage.py migrate on the master and using psql
on the replica
Mike
On 12/12/18 12:15 AM, Mike Lissner wrote:
On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting all
nulls to empty strings (a Django convention), and thendisallowing
nulls again. After letting it catch up, there were 118 nulls
on the
subscriber in this column:
So recap_sequence_number is not actually a number, it is a code?
It has sequential values, but they're not necessarily numbers.
I appreciate all the responses. I'm scared to say so, but I think
this is a bug in logical replication. Somehow a null valueappeared
at the subscriber that was never in the publisher.
I also still have this question/suggestion from my first email:
> Is the process for schema migrations documented somewhere
beyond
the above?
Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is
Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.Pretty simple/standard, I think:
- Changed the code.
- Generated the migration using manage.py makemigration
- Generated the SQL using sqlmigrate
- Ran the migration using manage.py migrate on the master and using
psql on the replica
The only thing I can think of involves this sequence on the subscriber:
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;
and then this:
https://www.postgresql.org/docs/11/logical-replication-subscription.html
"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."
https://www.postgresql.org/docs/10/sql-createtable.html
"If there is no default for a column, then the default is null."
So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.
Going forward, options I see:
1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.
2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.
Mike
--
Adrian Klaver
adrian.klaver@aklaver.com
This sounds *very* plausible. So I think there are a few takeaways:
1. Should the docs mention that additive changes with NOT NULL constraints
are bad?
2. Is there a way this could work without completely breaking replication?
For example, should Postgresql realize replication can't work in this
instance and then stop it until schemas are back in sync, like it does with
other incompatible schema changes? That'd be better than failing in this
way and is what I'd expect to happen.
3. Are there other edge cases like this that aren't well documented that we
can expect to creep up on us? If so, should we try to spell out exactly
*which* additive changes *are* OK?
This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.
Better schema migration docs would surely help, too.
Mike
On Wed, Dec 12, 2018 at 7:11 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Show quoted text
On 12/12/18 12:15 AM, Mike Lissner wrote:
On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, settingall
nulls to empty strings (a Django convention), and then
disallowing
nulls again. After letting it catch up, there were 118 nulls
on the
subscriber in this column:
So recap_sequence_number is not actually a number, it is a code?
It has sequential values, but they're not necessarily numbers.
I appreciate all the responses. I'm scared to say so, but I
think
this is a bug in logical replication. Somehow a null value
appeared
at the subscriber that was never in the publisher.
I also still have this question/suggestion from my first
email:
Is the process for schema migrations documented somewhere
beyond
the above?
Not that I know of. It might help, if possible, to detail the steps
in
the migration. Also what program you used to do it. Given that is
Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.Pretty simple/standard, I think:
- Changed the code.
- Generated the migration using manage.py makemigration
- Generated the SQL using sqlmigrate
- Ran the migration using manage.py migrate on the master and using
psql on the replicaThe only thing I can think of involves this sequence on the subscriber:
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;and then this:
https://www.postgresql.org/docs/11/logical-replication-subscription.html
"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."https://www.postgresql.org/docs/10/sql-createtable.html
"If there is no default for a column, then the default is null."
So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.Going forward, options I see:
1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.
2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.Mike
--
Adrian Klaver
adrian.klaver@aklaver.com
On 12/12/18 3:19 PM, Mike Lissner wrote:
This sounds *very* plausible. So I think there are a few takeaways:
1. Should the docs mention that additive changes with NOT NULL
constraints are bad?
It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.
2. Is there a way this could work without completely breaking
replication? For example, should Postgresql realize replication can't
work in this instance and then stop it until schemas are back in sync,
like it does with other incompatible schema changes? That'd be better
than failing in this way and is what I'd expect to happen.
Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.
3. Are there other edge cases like this that aren't well documented that
we can expect to creep up on us? If so, should we try to spell out
exactly *which* additive changes *are* OK?
Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.
This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.
Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:
https://www.postgresql.org/account/login/?next=/account/submitbug/
Better schema migration docs would surely help, too.
Mike
--
Adrian Klaver
adrian.klaver@aklaver.com
Thanks Adrian for all the help. I filed this as bug #15549. I hope this all
helps get logical replication into the "Running" stage.
On Wed, Dec 12, 2018 at 5:06 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Show quoted text
On 12/12/18 3:19 PM, Mike Lissner wrote:
This sounds *very* plausible. So I think there are a few takeaways:
1. Should the docs mention that additive changes with NOT NULL
constraints are bad?It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.2. Is there a way this could work without completely breaking
replication? For example, should Postgresql realize replication can't
work in this instance and then stop it until schemas are back in sync,
like it does with other incompatible schema changes? That'd be better
than failing in this way and is what I'd expect to happen.Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.3. Are there other edge cases like this that aren't well documented that
we can expect to creep up on us? If so, should we try to spell out
exactly *which* additive changes *are* OK?Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:https://www.postgresql.org/account/login/?next=/account/submitbug/
Better schema migration docs would surely help, too.
Mike
--
Adrian Klaver
adrian.klaver@aklaver.com