Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

Started by Chris Williamsover 4 years ago4 messagesgeneral

cswilliams@gmail.com

over 4 years ago

Hi,

We have a script that runs a pg_dump off of an RDS PG13.3 replica several
times per day. We then load this dump using pg_restore into another
postgres RDS db in another AWS account, scrub some of the data, and then
take a snapshot of it. We then use these snapshots for testing and staging
servers. This script has been running fine for years without any issues.
However, a little less than a month ago, we changed the pg_dump command in
the script to use the "-j" flag to speed up our dumps. It was my
understanding that starting with postgres v10, this option should now be
supported on replicas. Since we recently upgraded to Postgres 13.3 from
9.6, we thought it would be a good idea to start using this feature to
speed up our pg_dumps. In order to use parallel dumps, we made two changes
to our script: 1) We swapped to this command: pg_dump -Fd -j9 -Z0 -x
-hreplica -Uuser -f dump_dir For reference, the previous command (which
never had any problems) was: pg_dump -Fc -Z0 -x -hreplica -Uuser > dump 2)
We added code to resize our replica db to a larger hardware size before
starting the pg_dump so we can utilize more parallelism. After resizing the
db, the script waits for our replica to go back to the "available" state
before running the pg_dump command. When the pg_dump finishes, we then
resize the db back down to a t3.medium for cost reasons. When we first
rolled this out, the speed improvement was fantastic, it brought our
pg_dump down from 30 minutes to 5. However, within a week or so of rolling
this out, we've noticed that sometimes the dumps produced by our pg_dump
command were not consistent/synchronized. Sometimes when we would restore
the backup using pg_restore to our intermediate db it would throw duplicate
key errors or other types of constraint errors. Sometimes, even when the
pg_restore fully restored without errors, we noticed that some of the
sequences were wrong (we saw some sequences with a value less than the max
id in the column, so new inserts would start failing on the table). The
errors were never consistent either. It appears to me that the dump
produced by our pg_dump is not actually synchronized. I checked the
postgres logs on the server and don't see anything noteworthy. We also are
not getting any errors from our pg_dump command. Everything we are using is
version 13.3 (all of the dbs involved and the server running the script is
using v13.3 of pg_dump/pg_restore commands) I was curious if anyone had any
thoughts on what could be causing this. From my reading of the pg_dump
manual, it seems like it should always be producing a consistent dump, but
please correct me if I'm misunderstanding. I've also opened a ticket with
AWS's support to see if they have any ideas. For now, we've had to go back
to using the single threaded pg_dump (which is disappointing because it's
25 minutes slower). We were really hoping to take advantage of the
parallelism.

Any ideas would be much appreciated.

Thanks,
Chris

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Chris Williams (#1)

Re: Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

Chris Williams <cswilliams@gmail.com> writes:

We have a script that runs a pg_dump off of an RDS PG13.3 replica several
times per day. We then load this dump using pg_restore into another
postgres RDS db in another AWS account, scrub some of the data, and then
take a snapshot of it.

Hmm ... I'm fairly sure that RDS Postgres is not Postgres at this level
of detail. The info I've been able to find about their replication
mechanism talks about things like "eventually consistent reads", which
is not something community Postgres deals in.

In particular, what I'd expect from the community code is that a replica
could see a sequence as being *ahead* of the value that you might expect
from looking at related tables; but never behind. (Also, that statement
is true regardless of whether you are doing parallel dump.) And
non-sequence tables should always be consistent, period.

So I'm suspicious that this is an RDS-specific effect, and thus that
you should consult Amazon support first. If they say "no, it's Postgres
all the way down", then we need to look closer.

regards, tom lane

Chris Williams

cswilliams@gmail.com

over 4 years ago

In reply to: Tom Lane (#2)

Re: Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

Thanks Tom. It's a strange one for sure. Hopefully AWS support will shed
some light on it. I will clarify too that this is the regular RDS Postgres
version and not their other Aurora Postgres service. I suspect the Aurora
Postgres probably differs from the community version by quite a bit, but
I'm unsure how much their regular Postgres offering differs, if at all.

Thanks,
Chris

On Mon, Oct 18, 2021 at 8:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Chris Williams <cswilliams@gmail.com> writes:

We have a script that runs a pg_dump off of an RDS PG13.3 replica several
times per day. We then load this dump using pg_restore into another
postgres RDS db in another AWS account, scrub some of the data, and then
take a snapshot of it.

Hmm ... I'm fairly sure that RDS Postgres is not Postgres at this level
of detail. The info I've been able to find about their replication
mechanism talks about things like "eventually consistent reads", which
is not something community Postgres deals in.

In particular, what I'd expect from the community code is that a replica
could see a sequence as being *ahead* of the value that you might expect
from looking at related tables; but never behind. (Also, that statement
is true regardless of whether you are doing parallel dump.) And
non-sequence tables should always be consistent, period.

So I'm suspicious that this is an RDS-specific effect, and thus that
you should consult Amazon support first. If they say "no, it's Postgres
all the way down", then we need to look closer.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Chris Williams (#3)

Re: Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

Chris Williams <cswilliams@gmail.com> writes:

Thanks Tom. It's a strange one for sure. Hopefully AWS support will shed
some light on it. I will clarify too that this is the regular RDS Postgres
version and not their other Aurora Postgres service. I suspect the Aurora
Postgres probably differs from the community version by quite a bit, but
I'm unsure how much their regular Postgres offering differs, if at all.

Yeah, Aurora is definitely a different beast at the storage level.
I'm not entirely sure about RDS.

regards, tom lane