Errors with schema migration and logical replication — expected?

Started by Mike Lissnerover 7 years ago14 messagesgeneral
Jump to latest
#1Mike Lissner
mlissner@michaeljaylissner.com

Hi, first time poster.

I just ran into a rather messy problem when doing a schema migration with
logical replication. I'm not entirely sure what went wrong, why, or how to
prevent it in the future. The migration I ran was pretty simple (though
auto-generated by Django):

BEGIN;ALTER TABLE "search_docketentry" ADD COLUMN
"pacer_sequence_number" smallint NULL;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;ALTER TABLE
"search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50)
DEFAULT '' NOT NULL;ALTER TABLE "search_docketentry" ALTER COLUMN
"recap_sequence_number" DROP DEFAULT;ALTER TABLE "search_docketentry"
ALTER COLUMN "entry_number" DROP NOT NULL;ALTER TABLE
"search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT
'';ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number"
DROP DEFAULT;ALTER TABLE "search_docketentry" DROP CONSTRAINT
"search_docketentry_docket_id_12fd448b9aa007ca_uniq";CREATE INDEX
"search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON
"search_docketentry" ("recap_sequence_number", "entry_number");CREATE
INDEX "search_docketentry_eb19fcf7" ON "search_docketentry"
("pacer_sequence_number");CREATE INDEX "search_docketentry_bff4d47b"
ON "search_docketentry" ("recap_sequence_number");CREATE INDEX
"search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON
"search_docketentry" ("recap_sequence_number" varchar_pattern_ops);
COMMIT;
BEGIN;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;
COMMIT;

And after running this migration, I started getting this error on the
subscriber:

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains
(48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07
04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical
replication worker for subscription 18390 (PID 13373) exited with exit
code 1

So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.

I applied the above migration simultaneously on my publisher and subscriber
thinking that postgresql was smart enough to do the right thing. I think
the subscriber finished first (it has less traffic).

The docs hint that postgresql might be smart enough to not worry about the
order you do migrations:

*Logical replication is robust when schema definitions change in a live

database:* When the schema is changed on the publisher and replicated data
starts arriving at the subscriber but does not fit into the table schema,
replication will error until the schema is updated.

And it even hints that doing a migration on the subscriber first is a good
thing in some cases:

In many cases, intermittent errors can be avoided by applying additive

schema changes to the subscriber first.

But I'm now supremely skeptical that doing anything at the subscriber first
is a good idea. Are the docs wrong? Does the above error make sense? Is the
process for schema migrations documented somewhere beyond the above?

I have lots of questions because I thought this would have gone smoother
than it did.

As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with the
publisher. Anybody interested in following along with all this (or finding
this later and having questions) can follow the issue here:

https://github.com/freelawproject/courtlistener/issues/919

Thank you for the lovely database! I hope this is helpful.

Mike

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#1)
Re: Errors with schema migration and logical replication — expected?

On 12/8/18 11:26 PM, Mike Lissner wrote:

Hi, first time poster.

I just ran into a rather messy problem when doing a schema migration
with logical replication. I'm not entirely sure what went wrong, why, or
how to prevent it in the future. The migration I ran was pretty simple
(though auto-generated by Django):

And after running this migration, I started getting this error on the
subscriber:

|2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started 2018-12-09
05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint 2018-12-09 05:59:45
UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07
04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214,
null, null). 2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process:
logical replication worker for subscription 18390 (PID 13373) exited
with exit code 1|

So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.

The above seems to be the crux of the problem, how did NULL get into the
column data?

The DDL migration did what it was supposed to do.

I applied the above migration simultaneously on my publisher and
subscriber thinking that postgresql was smart enough to do the right
thing. I think the subscriber finished first (it has less traffic).

And it did as far as the schema changes where concerned.

The docs hint that postgresql might be smart enough to not worry about
the order you do migrations:

/Logical replication is robust when schema definitions change in a

live database:/ When the schema is changed on the publisher and
replicated data starts arriving at the subscriber but does not fit into
the table schema, replication will error until the schema is updated.

And it even hints that doing a migration on the subscriber first is a
good thing in some cases:

In many cases, intermittent errors can be avoided by applying

additive schema changes to the subscriber first.

But I'm now supremely skeptical that doing anything at the subscriber
first is a good idea. Are the docs wrong? Does the above error make
sense? Is the process for schema migrations documented somewhere beyond
the above?

I have lots of questions because I thought this would have gone smoother
than it did.

As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with
the publisher. Anybody interested in following along with all this (or
finding this later and having questions) can follow the issue here:

https://github.com/freelawproject/courtlistener/issues/919

Thank you for the lovely database! I hope this is helpful.

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#2)
Re: Errors with schema migration and logical replication — expected?

The above seems to be the crux of the problem, how did NULL get into the
column data?

I agree. My queries are generated by Django (so I never write SQL myself),
but:

- the column has always been NOT NULL for its entire lifetime
- we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
- the publisher now has a NOT NULL constraint on that column. I never had
to clear out null values to put it in place. I assume that if that column
ever had a null value and I tried to run a DDL to add a null constraint,
the DDL would have failed, right?

Something feels wrong here, the more I think about it.

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#3)
Re: Errors with schema migration and logical replication — expected?

On 12/9/18 8:03 AM, Mike Lissner wrote:

The above seems to be the crux of the problem, how did NULL get into
the
column data?

I agree. My queries are generated by Django (so I never write SQL
myself), but:

 - the column has always been NOT NULL for its entire lifetime

The lifetime being since the migration did this?:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?

 - we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
 - the publisher now has a NOT NULL constraint on that column. I never
had to clear out null values to put it in place. I assume that if that

This part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.

column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?

Something feels wrong here, the more I think about it.

A start would be to figure out what generated?:

failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#4)
Re: Errors with schema migration and logical replication — expected?

On Sun, Dec 9, 2018 at 8:43 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 12/9/18 8:03 AM, Mike Lissner wrote:

The above seems to be the crux of the problem, how did NULL get into
the
column data?

I agree. My queries are generated by Django (so I never write SQL
myself), but:

- the column has always been NOT NULL for its entire lifetime

The lifetime being since the migration did this?:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

"Lifetime" meaning that there was never a time when this column allowed
nulls.

Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?

Good idea, but no. This column only exists in one table.

- we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
- the publisher now has a NOT NULL constraint on that column. I never
had to clear out null values to put it in place. I assume that if that

This part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.

Sorry, what I mean is that if *somehow* the master had null values in that
column at some point, which I don't know how would even be possible because
it only came into existence with the command above — if somehow that
happened, I'd know, because I wouldn't have been *able* to add a NULL
constraint without first fixing the data in that column, which I never did.

My contention is that for all these reasons, there should *never* have been
a null value in that column on master.

column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?

Something feels wrong here, the more I think about it.

A start would be to figure out what generated?:

failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)

Yes, I completely agree. I can't think of any way that that should have
ever been created.

Show quoted text

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#5)
Re: Errors with schema migration and logical replication — expected?

On 12/9/18 10:47 AM, Mike Lissner wrote:

My contention is that for all these reasons, there should *never* have
been a null value in that column on master.

column ever had a null value and I tried to run a DDL to add a null
constraint, the DDL would have failed, right?

Something feels wrong here, the more I think about it.

A start would be to figure out what generated?:

failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)

Yes, I completely agree. I can't think of any way that that should have
ever been created.

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

2) And that the row that failed in the subscriber is in the publisher table.

3) That there are no NULL values in the publisher column?

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

a) If not then you could leave things as they are.

b) If so then you:

1) Have to figure out what is sending NULL values to the column.

Maybe a model that has null=True set when it shouldn't be?

A Form/ModelForm that is allowing None/Null?

Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.

2) Clean up the NULL values in the column in the subscriber
and/or publisher.

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#6)
Re: Errors with schema migration and logical replication — expected?

On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

Yes, on the publisher and the subscriber. That was my first step when I saw
the log lines about this.

2) And that the row that failed in the subscriber is in the publisher table.

Yep, it's there (though it doesn't show a null for that column, and I don't
know how it ever could have).

3) That there are no NULL values in the publisher column?

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense since
they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.

Well, I was able to fix this by briefly allowing nulls on the subscriber,
letting it catch up with the publisher, setting all nulls to empty strings
(a Django convention), and then disallowing nulls again. After letting it
catch up, there were 118 nulls on the subscriber in this column:

https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185

That shouldn't be possible since nulls were never allowed in this column on
the publisher.

b) If so then you:

1) Have to figure out what is sending NULL values to the column.

Maybe a model that has null=True set when it shouldn't be?

Nope, never had that. I'm 100% certain.

A Form/ModelForm that is allowing None/Null?

Even if that was the case, the error wouldn't have shown up on the
subscriber since that null would have never been allowed in the publisher.
But anyway, I don't use any forms with this column.

Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.

That's an idea, but like I said, nothing sends SQL to the subscriber (not
even read requests), and this shouldn't have been possible in the publisher
due to the NOT NULL constraint that has *always* been on that column.

2) Clean up the NULL values in the column in the subscriber

and/or publisher.

There were only NULL values in the subscriber, never in the publisher.
Something is amiss here.

I appreciate all the responses. I'm scared to say so, but I think this is a
bug in logical replication. Somehow a null value appeared at the subscriber
that was never in the publisher.

I also still have this question/suggestion from my first email:

Is the process for schema migrations documented somewhere beyond the

above?

Thank you again,

Mike

#8Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Mike Lissner (#7)
Re: Errors with schema migration and logical replication — expected?

Reupping this since it was over the weekend and looks like a bug in logical
replication. My problems are solved, but some very weird things happened
when doing a schema migration.

On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner <mlissner@michaeljaylissner.com>
wrote:

Show quoted text

On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

Yes, on the publisher and the subscriber. That was my first step when I
saw the log lines about this.

2) And that the row that failed in the subscriber is in the publisher

table.

Yep, it's there (though it doesn't show a null for that column, and I
don't know how it ever could have).

3) That there are no NULL values in the publisher column?

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense since
they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.

Well, I was able to fix this by briefly allowing nulls on the subscriber,
letting it catch up with the publisher, setting all nulls to empty strings
(a Django convention), and then disallowing nulls again. After letting it
catch up, there were 118 nulls on the subscriber in this column:

https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185

That shouldn't be possible since nulls were never allowed in this column
on the publisher.

b) If so then you:

1) Have to figure out what is sending NULL values to the column.

Maybe a model that has null=True set when it shouldn't be?

Nope, never had that. I'm 100% certain.

A Form/ModelForm that is allowing None/Null?

Even if that was the case, the error wouldn't have shown up on the
subscriber since that null would have never been allowed in the publisher.
But anyway, I don't use any forms with this column.

Some code that is operating outside the ORM e.g. doing a
direct query using from django.db import connection.

That's an idea, but like I said, nothing sends SQL to the subscriber (not
even read requests), and this shouldn't have been possible in the publisher
due to the NOT NULL constraint that has *always* been on that column.

2) Clean up the NULL values in the column in the subscriber

and/or publisher.

There were only NULL values in the subscriber, never in the publisher.
Something is amiss here.

I appreciate all the responses. I'm scared to say so, but I think this is
a bug in logical replication. Somehow a null value appeared at the
subscriber that was never in the publisher.

I also still have this question/suggestion from my first email:

Is the process for schema migrations documented somewhere beyond the

above?

Thank you again,

Mike

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#8)
Re: Errors with schema migration and logical replication — expected?

On 12/11/18 2:21 PM, Mike Lissner wrote:

Reupping this since it was over the weekend and looks like a bug in
logical replication. My problems are solved, but some very weird things
happened when doing a schema migration.

On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner
<mlissner@michaeljaylissner.com <mailto:mlissner@michaeljaylissner.com>>
wrote:

On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

1) Using psql have you verified that NOT NULL is set on that
column on
the publisher?

Yes, on the publisher and the subscriber. That was my first step
when I saw the log lines about this.

2) And that the row that failed in the subscriber is in the
publisher table.

Yep, it's there (though it doesn't show a null for that column, and
I don't know how it ever could have).

3) That there are no NULL values in the publisher column?

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense
since they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.

Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting all
nulls to empty strings (a Django convention), and then disallowing
nulls again. After letting it catch up, there were 118 nulls on the
subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

I appreciate all the responses. I'm scared to say so, but I think
this is a bug in logical replication. Somehow a null value appeared
at the subscriber that was never in the publisher.

I also still have this question/suggestion from my first email:

Is the process for schema migrations documented somewhere beyond

the above?

Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or sqlmigrate.

Thank you again,

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#9)
Re: Errors with schema migration and logical replication — expected?

On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting all
nulls to empty strings (a Django convention), and then disallowing
nulls again. After letting it catch up, there were 118 nulls on the
subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

It has sequential values, but they're not necessarily numbers.

I appreciate all the responses. I'm scared to say so, but I think
this is a bug in logical replication. Somehow a null value appeared
at the subscriber that was never in the publisher.

I also still have this question/suggestion from my first email:

Is the process for schema migrations documented somewhere beyond

the above?

Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.

Pretty simple/standard, I think:
- Changed the code.
- Generated the migration using manage.py makemigration
- Generated the SQL using sqlmigrate
- Ran the migration using manage.py migrate on the master and using psql
on the replica

Mike

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#10)
Re: Errors with schema migration and logical replication — expected?

On 12/12/18 12:15 AM, Mike Lissner wrote:

On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

     Well, I was able to fix this by briefly allowing nulls on the
     subscriber, letting it catch up with the publisher, setting all
     nulls to empty strings (a Django convention), and then

disallowing

     nulls again. After letting it catch up, there were 118 nulls

on the

     subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

It has sequential values, but they're not necessarily numbers.

     I appreciate all the responses. I'm scared to say so, but I think
     this is a bug in logical replication. Somehow a null value

appeared

     at the subscriber that was never in the publisher.

     I also still have this question/suggestion from my first email:

      > Is the process for schema migrations documented somewhere

beyond

     the above?

Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is
Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.

Pretty simple/standard, I think:
 - Changed the code.
 - Generated the migration using manage.py makemigration
 - Generated the SQL using sqlmigrate
 - Ran the migration using manage.py migrate on the master and using
psql on the replica

The only thing I can think of involves this sequence on the subscriber:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

and then this:

https://www.postgresql.org/docs/11/logical-replication-subscription.html

"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."

https://www.postgresql.org/docs/10/sql-createtable.html

"If there is no default for a column, then the default is null."

So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.

Going forward, options I see:

1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.

2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#11)
Re: Errors with schema migration and logical replication — expected?

This sounds *very* plausible. So I think there are a few takeaways:

1. Should the docs mention that additive changes with NOT NULL constraints
are bad?

2. Is there a way this could work without completely breaking replication?
For example, should Postgresql realize replication can't work in this
instance and then stop it until schemas are back in sync, like it does with
other incompatible schema changes? That'd be better than failing in this
way and is what I'd expect to happen.

3. Are there other edge cases like this that aren't well documented that we
can expect to creep up on us? If so, should we try to spell out exactly
*which* additive changes *are* OK?

This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.

Better schema migration docs would surely help, too.

Mike

On Wed, Dec 12, 2018 at 7:11 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 12/12/18 12:15 AM, Mike Lissner wrote:

On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

Well, I was able to fix this by briefly allowing nulls on the
subscriber, letting it catch up with the publisher, setting

all

nulls to empty strings (a Django convention), and then

disallowing

nulls again. After letting it catch up, there were 118 nulls

on the

subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

It has sequential values, but they're not necessarily numbers.

I appreciate all the responses. I'm scared to say so, but I

think

this is a bug in logical replication. Somehow a null value

appeared

at the subscriber that was never in the publisher.

I also still have this question/suggestion from my first

email:

Is the process for schema migrations documented somewhere

beyond

the above?

Not that I know of. It might help, if possible, to detail the steps

in

the migration. Also what program you used to do it. Given that is
Django
I am assuming some combination of migrate, makemigrations and/or
sqlmigrate.

Pretty simple/standard, I think:
- Changed the code.
- Generated the migration using manage.py makemigration
- Generated the SQL using sqlmigrate
- Ran the migration using manage.py migrate on the master and using
psql on the replica

The only thing I can think of involves this sequence on the subscriber:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

and then this:

https://www.postgresql.org/docs/11/logical-replication-subscription.html

"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."

https://www.postgresql.org/docs/10/sql-createtable.html

"If there is no default for a column, then the default is null."

So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.

Going forward, options I see:

1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.

2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com

#13Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Mike Lissner (#12)
Re: Errors with schema migration and logical replication — expected?

On 12/12/18 3:19 PM, Mike Lissner wrote:

This sounds *very* plausible. So I think there are a few takeaways:

1. Should the docs mention that additive changes with NOT NULL
constraints are bad?

It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.

2. Is there a way this could work without completely breaking
replication? For example, should Postgresql realize replication can't
work in this instance and then stop it until schemas are back in sync,
like it does with other incompatible schema changes? That'd be better
than failing in this way and is what I'd expect to happen.

Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.

3. Are there other edge cases like this that aren't well documented that
we can expect to creep up on us? If so, should we try to spell out
exactly *which* additive changes *are* OK?

Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.

This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.

Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:

https://www.postgresql.org/account/login/?next=/account/submitbug/

Better schema migration docs would surely help, too.

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com

#14Mike Lissner
mlissner@michaeljaylissner.com
In reply to: Adrian Klaver (#13)
Re: Errors with schema migration and logical replication — expected?

Thanks Adrian for all the help. I filed this as bug #15549. I hope this all
helps get logical replication into the "Running" stage.

On Wed, Dec 12, 2018 at 5:06 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 12/12/18 3:19 PM, Mike Lissner wrote:

This sounds *very* plausible. So I think there are a few takeaways:

1. Should the docs mention that additive changes with NOT NULL
constraints are bad?

It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.

2. Is there a way this could work without completely breaking
replication? For example, should Postgresql realize replication can't
work in this instance and then stop it until schemas are back in sync,
like it does with other incompatible schema changes? That'd be better
than failing in this way and is what I'd expect to happen.

Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.

3. Are there other edge cases like this that aren't well documented that
we can expect to creep up on us? If so, should we try to spell out
exactly *which* additive changes *are* OK?

Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.

This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.

Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:

https://www.postgresql.org/account/login/?next=/account/submitbug/

Better schema migration docs would surely help, too.

Mike

--
Adrian Klaver
adrian.klaver@aklaver.com