Potential G2-item cycles under serializable isolation
Hello everyone!
First off, I'm sorry for *gestures vaguely* all of this. Second, I think I may
have found a serializability violation in Postgres 12.3, involving
anti-dependency cycles around row insertion.
For background, my name is Kyle Kingsbury, and I test distributed database
safety properties (https://jepsen.io). I started looking at Stolon + PostgreSQL
this week, encountered this behavior, and managed to narrow it down to a single
Postgres node without Stolon at all. Normally I test with a variety of faults
(network partitions, crashes, etc.), but this behavior occurs in healthy
processes without any faults.
This test uses the Jepsen testing library (https://github.com/jepsen-io/jepsen)
and the Elle isolation checker (https://github.com/jepsen-io/elle). If you're
wondering "why would you ever do transactions like this", the Elle paper might
provide some helpful background: https://arxiv.org/abs/2003.10554. We install
Postgres 12.3-1.pgdg100+1 on a Debian 10 node, using the official Postgres
repository at http://apt.postgresql.org/pub/repos/apt/. Each client uses its own
JDBC connection, on a single thread, in a single JVM process. We use the JDBC
postgres driver (org.postgresql/postgresql 42.2.12). The JDK is 1.8.0_40-b25.
Logically, the test performs randomly generated transactions over a set of lists
identified by integer keys. Each operation is either a read, which returns the
current value of the list for a given key, or an append, which adds a unique
element to the end of the list for a given key. In Postgres, we store these
objects in tables like so:
create table if not exists txn0 (id int not null primary key,
sk int not null,
val text)
id is the key, and text stores comma-separated elements. sk is a secondary key,
which is unused here. We create three tables like this (txn0, txn1, txn2).
Records are striped across tables by hashing their key.
We set the session transaction isolation level immediately after opening every
connection:
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE
... and request SERIALIZABLE for each JDBC transaction as well.
Our reads are of the form:
select (val) from txn0 where id = $1
And our writes are of the form:
insert into txn1 as t (id, sk, val) values ($1, $2, $3) on conflict (id) do
update set val = CONCAT(t.val, ',', $4) where t.id = $5
where $1 and $5 are the key, and $2, $3, and $4 are the element we'd like to
append to the list.
You can try these transactions for yourself using Jepsen f47eb25. You'll need a
Jepsen environment; see
https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment for details.
cd jepsen/stolon
lein run test-all -w append --nemesis none --max-writes-per-key 8 --node n1
--just-postgres --concurrency 50 -r 1000
Which typically produces, after about a minute, anomalies like the following:
G2-item #1
Let:
T1 = {:type :ok, :f :txn, :value [[:r 7 [1]] [:append 12 1]], :time 95024280,
:process 5, :index 50}
T2 = {:type :ok, :f :txn, :value [[:append 7 2] [:r 14 nil] [:append 14 1]
[:r 12 nil]], :time 98700211, :process 6, :index 70}
Then:
- T1 < T2, because T1 did not observe T2's append of 2 to 7.
- However, T2 < T1, because T2 observed the initial (nil) state of 12, which
T1 created by appending 1: a contradiction!
A dependency graph of this anomaly is attached to this email: lines marked `rw`
indicate read-write anti-dependencies between specific operations in each
transaction. Because there are multiple rw edges in this graph, it suggests the
presence of G2-item. It is also possible, of course, that worse anomalies
happened (e.g. aborted read) which caused us to incorrectly infer this causal
graph, but I suspect this is not the case.
You can find a full copy of this particular test run, including a history of
every transaction, Postgres logs, and a pcap file containing all client-server
interactions, at
http://jepsen.io.s3.amazonaws.com/analyses/postgresql-12.3/20200531T215019.000-0400.zip
If you'd like to look at the test code, see
https://github.com/jepsen-io/jepsen/tree/f47eb25ab32529a7b66f1dfdd3b5ef2fc84ed778/stolon/src/jepsen.
Specifically, setup code is here:
... and the workload responsible for constructing and submitting transactions is
here:
These anomalies appear limited to G2-item: I haven't seen G-single (read skew),
cyclic information flow, aborted reads, dirty writes, etc. It also looks as if
every anomaly involves a *nil* read, which suggests (and I know the bug report
guidelines say not to speculate, but experience suggests this might be helpful)
that there is something special about row insertion. In TiDB, for instance, we
found that G2-item anomalies with `select ... for update` was linked to the fact
that TiDB's lock manager couldn't lock keys which hadn't been created yet
(https://jepsen.io/analyses/tidb-2.1.7#select-for-update). I don't know anything
about Postgres' internals, but I hope this is of some use!
Is this... known behavior? Unexpected? Are there configuration flags or client
settings I should double-check? I know this is all a bit much, so I'm happy to
answer any questions you might have. :-)
Sincerely,
--Kyle
Attachments:
Hi Kyle,
On Sun, May 31, 2020 at 7:25 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
Our reads are of the form:
select (val) from txn0 where id = $1
And our writes are of the form:
insert into txn1 as t (id, sk, val) values ($1, $2, $3) on conflict (id) do
update set val = CONCAT(t.val, ',', $4) where t.id = $5
ON CONFLICT DO UPDATE has some novel behaviors in READ COMMITTED mode,
but they're not supposed to affect SERIALIZABLE mode.
where $1 and $5 are the key, and $2, $3, and $4 are the element we'd like to
append to the list.You can try these transactions for yourself using Jepsen f47eb25. You'll need a
Jepsen environment; see
https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment for details.cd jepsen/stolon
lein run test-all -w append --nemesis none --max-writes-per-key 8 --node n1
--just-postgres --concurrency 50 -r 1000
We generally like to produce tests for SSI, ON CONFLICT DO UPDATE, and
anything else involving concurrent behavior using something called
isolation tester:
https://github.com/postgres/postgres/tree/master/src/test/isolation
We may end up writing an isolation test for the issue you reported as
part of an eventual fix. You might find it helpful to review some of
the existing tests.
Which typically produces, after about a minute, anomalies like the following:
G2-item #1
Let:
T1 = {:type :ok, :f :txn, :value [[:r 7 [1]] [:append 12 1]], :time 95024280,
:process 5, :index 50}
T2 = {:type :ok, :f :txn, :value [[:append 7 2] [:r 14 nil] [:append 14 1]
[:r 12 nil]], :time 98700211, :process 6, :index 70}Then:
- T1 < T2, because T1 did not observe T2's append of 2 to 7.
- However, T2 < T1, because T2 observed the initial (nil) state of 12, which
T1 created by appending 1: a contradiction!
Could you test Postgres 9.5? It would be nice to determine if this is
a new issue, or a regression.
Thanks
--
Peter Geoghegan
On 5/31/20 11:04 PM, Peter Geoghegan wrote:
We generally like to produce tests for SSI, ON CONFLICT DO UPDATE, and
anything else involving concurrent behavior using something called isolation
tester: https://github.com/postgres/postgres/tree/master/src/test/isolation We
may end up writing an isolation test for the issue you reported as part of an
eventual fix. You might find it helpful to review some of the existing tests.
Ah, wonderful! I don't exactly know how to plug Elle's history analysis into
this, but I think it... should be possible to write down some special cases
based on the histories I've seen.
Could you test Postgres 9.5? It would be nice to determine if this is
a new issue, or a regression.
I'll look into that tomorrow morning! :)
I, uh, backed off to running these tests at read committed (which, uh... should
be SI, right?) and found what appear to be scads of SI violations, including
read skew and even *internal* consistency anomalies. Read-only transactions
can... apparently... see changing values of a record? Here's a single
transaction which read key 21, got [1], then read key 21 again, and saw [1 2 3]:
[[:r 21 [1]] [:r 20 [1 2]] [:r 20 [1 2]] [:r 21 [1 2 3]]]
See
http://jepsen.io.s3.amazonaws.com/analyses/postgresql-12.3/20200531T223558.000-0400.zip
-- jepsen.log from 22:36:09,907 to 22:36:09,909:
2020-05-31 22:36:09,907{GMT} INFO [jepsen worker 6] jepsen.util: 6
:invoke :txn [[:r 21 nil] [:r 20 nil] [:r 20 nil] [:r 21 nil]]
...
2020-05-31 22:36:09,909{GMT} INFO [jepsen worker 6] jepsen.util: 6
:ok :txn [[:r 21 [1]] [:r 20 [1 2]] [:r 20 [1 2]] [:r 21 [1 2 3]]]
You can fire up wireshark and point it at the pcap file in n1/ to
double-check--try `tcp.stream eq 4`. The BEGIN statement for this transaction is
at 22:36:09.908115. There are a bunch more anomalies called out in analysis.edn,
if it's helpful.
This looks so weird that I assume I've *got* to be doing it wrong, but trawling
through the source code and pcap trace, I can't see where the mistake is. Maybe
I'll have fresher eyes in the morning. :)
Sincerely,
--Kyle
On Sun, May 31, 2020 at 8:37 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
This looks so weird that I assume I've *got* to be doing it wrong, but trawling
through the source code and pcap trace, I can't see where the mistake is. Maybe
I'll have fresher eyes in the morning. :)
READ COMMITTED starts each command within a transaction with its own
snapshot, much like Oracle:
https://www.postgresql.org/docs/devel/transaction-iso.html
There cannot be serialization errors with READ COMMITTED mode, and in
general it is a lot more permissive. Probably to the point where it
isn't sensible to test with Jepsen at all.
It would make sense for you to test REPEATABLE READ isolation level
separately, though. It implements snapshot isolation without the added
overhead of the mechanisms that prevent (or are supposed to prevent)
serialization anomalies.
--
Peter Geoghegan
Oh! Gosh, yes, that DOES explain it. Somehow I had it in my head that both
RU and RC mapped to SI, and RR & Serializable mapped to SSI. That's the
case in YugabyteDB, but not here!
I'll also see about getting a version of these tests that doesn't involve
ON CONFLICT, in case that's relevant.
--Kyle
On Sun, May 31, 2020, 23:56 Peter Geoghegan <pg@bowt.ie> wrote:
Show quoted text
On Sun, May 31, 2020 at 8:37 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
This looks so weird that I assume I've *got* to be doing it wrong, but
trawling
through the source code and pcap trace, I can't see where the mistake
is. Maybe
I'll have fresher eyes in the morning. :)
READ COMMITTED starts each command within a transaction with its own
snapshot, much like Oracle:https://www.postgresql.org/docs/devel/transaction-iso.html
There cannot be serialization errors with READ COMMITTED mode, and in
general it is a lot more permissive. Probably to the point where it
isn't sensible to test with Jepsen at all.It would make sense for you to test REPEATABLE READ isolation level
separately, though. It implements snapshot isolation without the added
overhead of the mechanisms that prevent (or are supposed to prevent)
serialization anomalies.--
Peter Geoghegan
On Mon, Jun 1, 2020 at 4:05 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
I'll also see about getting a version of these tests that doesn't involve ON CONFLICT, in case that's relevant.
That should be interesting. I'm wondering if the read of the old
value in the UPDATE case is not done with appropriate predicate locks,
so we miss a graph edge?
Hi Kyle,
On Sun, May 31, 2020 at 9:05 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
Oh! Gosh, yes, that DOES explain it. Somehow I had it in my head that both RU and RC mapped to SI, and RR & Serializable mapped to SSI. That's the case in YugabyteDB, but not here!
It's confusing because the standard only requires that the isolation
levels avoid certain read phenomena, but implementations are free to
go above and beyond. For example, you can ask Postgres for READ
UNCOMMITTED, but you'll get READ COMMITTED. (So RC, RR, and SI each
provide distinct behavior.)
I'll also see about getting a version of these tests that doesn't involve ON CONFLICT, in case that's relevant.
That would be great. It could easily be relevant.
Thanks
--
Peter Geoghegan
On 6/1/20 12:20 AM, Peter Geoghegan wrote:
It's confusing because the standard only requires that the isolation > levels avoid certain read phenomena, but implementations are free to > go
above and beyond. For example, you can ask Postgres for READ > UNCOMMITTED, but
you'll get READ COMMITTED. (So RC, RR, and SI each > provide distinct behavior.)
Right, right. I was thinking "Oh, repeatable read is incomparable with snapshot,
so it must be that read committed is snapshot, and repeatable is serializable."
This way around, Postgres "repeatable read" actually gives you behavior that
violates repeatable read! But I understand the pragmatic rationale of "we need 3
levels, and this is the closest mapping we could get to the ANSI SQL names". :)
--Kyle
On Sun, May 31, 2020 at 9:29 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
Right, right. I was thinking "Oh, repeatable read is incomparable with snapshot, so it must be that read committed is snapshot, and repeatable is serializable."
Right.
We used to call snapshot isolation (i.e., the behavior we now provide
under RR mode) SERIALIZABLE, which was wrong (still is in Oracle).
This was how Postgres worked before the SSI feature was added back in
2011. SSI became the new SERIALIZABLE at that time. Ordinary snapshot
isolation was "demoted" to being called RR mode.
--
Peter Geoghegan
OK! So I've designed a variant of this test which doesn't use ON CONFLICT.
Instead, we do a homebrew sort of upsert: we try to update the row in place by
primary key; if we see zero records updated, we insert a new row, and if *that*
fails due to the primary key conflict, we try the update again, under the theory
that since we now know a copy of the row exists, we should be able to update it.
This isn't bulletproof; even under SERIALIZABLE, Postgres will allow a
transaction to fail to update any rows, then fail to insert due to a primary key
conflict--but because the primary key conflict forces a transaction abort (or
rolling back to a savepoint, which negates the conflict), I think it's still
serializable. The isolation docs explain this, I think.
Unfortunately, I'm still seeing tons of G2-item cycles. Whatever this is, it's
not related to ON CONFLICT.
As a side note, I can confirm that Postgres repeatable read is definitely weaker
than repeatable read--at least as formalized by Adya et al. This makes sense if
you understand repeatable read to mean SI (after all, the SI paper says they're
incomparable!), but the Postgres docs seem to imply Postgres is strictly
*stronger* than the ANSI SQL spec, and I'm not convinced that's entirely true:
https://www.postgresql.org/docs/12/transaction-iso.html
The table also shows that PostgreSQL's Repeatable Read implementation does
not allow phantom reads. Stricter behavior is permitted by the SQL standard...
This is a stronger guarantee than is required by the SQL standard for this
isolation level...
I get the sense that the Postgres docs have already diverged from the ANSI SQL
standard a bit, since SQL 92 only defines three anomalies (P1, P2, P3), and
Postgres defines a fourth: "serialization anomaly".
This results in a sort of weird situation: on the one hand, it's true: we don't
(I think) observe P1 or P2 under Postgres Repeatable Read. On the other hand,
SQL 92 says that the difference between repeatable read and serializable is
*exactly* the prohibition of P3 ("phantom"). Even though all our operations are
performed by primary key, we can observe a distinct difference between Postgres
repeatable read and Postgres serializable.
I can see two ways to reconcile this--one being that Postgres chose the anomaly
interpretation of the SQL spec, and the result is... maybe internally
inconsistent? Or perhaps one of the operations in this workload actually *is* a
predicate operation--maybe by dint of relying on a uniqueness constraint?
I'm surprised the transaction isolation docs don't say something like "Postgres
repeatable read level means snapshot isolation, which is incomparable with SQL
repeatable read." Then it'd be obvious that Postgres repeatable read exhibits
G2-item! Weirdly, the term "snapshot isolation" doesn't show up on the page at all!
--Kyle
On Tue, Jun 2, 2020 at 9:19 AM Kyle Kingsbury <aphyr@jepsen.io> wrote:
OK! So I've designed a variant of this test which doesn't use ON CONFLICT.
Instead, we do a homebrew sort of upsert: we try to update the row in place by
primary key; if we see zero records updated, we insert a new row, and if *that*
fails due to the primary key conflict, we try the update again, under the theory
that since we now know a copy of the row exists, we should be able to update it.
Thanks, but I think that this link is wrong, since you're still using
ON CONFLICT. Correct me if I'm wrong, I believe that you intended to
link to this:
Unfortunately, I'm still seeing tons of G2-item cycles. Whatever this is, it's
not related to ON CONFLICT.
Good to have that confirmed. Obviously we'll need to do more analysis
of the exact circumstances of the anomaly. That might take a while.
I get the sense that the Postgres docs have already diverged from the ANSI SQL
standard a bit, since SQL 92 only defines three anomalies (P1, P2, P3), and
Postgres defines a fourth: "serialization anomaly".
I can see two ways to reconcile this--one being that Postgres chose the anomaly
interpretation of the SQL spec, and the result is... maybe internally
inconsistent? Or perhaps one of the operations in this workload actually *is* a
predicate operation--maybe by dint of relying on a uniqueness constraint?
You might find that "A Critique of ANSI SQL Isolation Levels" provides
useful background information:
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
One section in particular may be of interest:
"ANSI SQL intended to define REPEATABLE READ isolation to exclude all
anomalies except Phantom. The anomaly definition of Table 1 does not
achieve this goal, but the locking definition of Table 2 does. ANSI’s
choice of the term Repeatable Read is doubly unfortunate: (1)
repeatable reads do not give repeatable results, and (2) the industry
had already used the term to mean exactly that: repeatable reads mean
serializable in several products. We recommend that another term be
found for this."
--
Peter Geoghegan
On 6/2/20 12:50 PM, Peter Geoghegan wrote:
On Tue, Jun 2, 2020 at 9:19 AM Kyle Kingsbury <aphyr@jepsen.io> wrote:
OK! So I've designed a variant of this test which doesn't use ON CONFLICT.
Instead, we do a homebrew sort of upsert: we try to update the row in place by
primary key; if we see zero records updated, we insert a new row, and if *that*
fails due to the primary key conflict, we try the update again, under the theory
that since we now know a copy of the row exists, we should be able to update it.Thanks, but I think that this link is wrong, since you're still using
ON CONFLICT. Correct me if I'm wrong, I believe that you intended to
link to this:
Whoops, yes, that's correct. :-)
Good to have that confirmed. Obviously we'll need to do more analysis
of the exact circumstances of the anomaly. That might take a while.
No worries! Is it still important that I check this behavior with 9.x as well?
You might find that "A Critique of ANSI SQL Isolation Levels" provides
useful background information:https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
This is one of my favorite papers, and precisely why I asked!
One section in particular may be of interest:
"ANSI SQL intended to define REPEATABLE READ isolation to exclude all
anomalies except Phantom. The anomaly definition of Table 1 does not
achieve this goal, but the locking definition of Table 2 does. ANSI’s
choice of the term Repeatable Read is doubly unfortunate: (1)
repeatable reads do not give repeatable results, and (2) the industry
had already used the term to mean exactly that: repeatable reads mean
serializable in several products. We recommend that another term be
found for this."
So... just to confirm, Postgres *did* go along with the anomaly interpretation,
rather than the strict interpretation? It's just weird cuz, like... the Postgres
docs act like SI is stronger than RR, but Berenson et al are pretty clear that's
not how they see it!
--Kyle
On Tue, Jun 2, 2020 at 9:58 AM Kyle Kingsbury <aphyr@jepsen.io> wrote:
No worries! Is it still important that I check this behavior with 9.x as well?
I asked about 9.5 because I think that it's possible (though not
particularly likely) that some of the B-Tree indexing work that went
into Postgres 12 is a factor (predicate locks can be taken against
individual leaf pages, and the way that that works changed slightly).
SSI was verified using extensive stress tests during its initial
development (by Dan Ports), so it's not inconceivable that there was
some kind of subtle regression since that time. That's just a guess,
but it would be nice to eliminate it as a theory.
I'd be surprised if your existing test cases needed any adjustment. My
guess is that this won't take long.
So... just to confirm, Postgres *did* go along with the anomaly interpretation,
rather than the strict interpretation? It's just weird cuz, like... the Postgres
docs act like SI is stronger than RR, but Berenson et al are pretty clear that's
not how they see it!
I wasn't involved in the decision making process that led to that, and
it's possible that those that were weren't even aware of the paper. It
was necessary to shoehorn SSI/true serializability into the existing
isolation levels for compatibility reasons, and those were always
based on the anomaly interpretation.
--
Peter Geoghegan
On Wed, Jun 3, 2020 at 4:19 AM Kyle Kingsbury <aphyr@jepsen.io> wrote:
I'm looking into this, but just by the way, you said:
; OK, so first worrying thing: why can this throw duplicate key errors if
; it's executed with "if not exists"?
(try
(j/execute! conn
[(str "create table if not exists " (table-name i)
....
That's (unfortunately) a known problem under concurrency. It'd be
very nice to fix that, but it's an independent problem relating to DDL
(not just tables, and not just IF EXISTS DDL, but anything modifying
catalogues can race in this way and miss out on "nice" error messages
or the IF EXISTS no-op). Here's a good short summary:
/messages/by-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com
On Tue, Jun 2, 2020 at 10:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
I'd be surprised if your existing test cases needed any adjustment. My
guess is that this won't take long.
You said it takes about a minute in your opening e-mail; how
consistent is this? I note from the Postgres logs you provided that
Postgres starts accepting connections at 2020-05-31 18:50:27.580, and
shows its last log message at 2020-05-31 18:51:29.781 PDT. So it's
suspiciously close to *exactly* one minute. Note that
autovacuum_naptime has as its default '1min'. Your workload probably
generates a lot of index bloat, which may tend to cause autovacuum to
want to delete whole B-Tree leaf pages, which impacts predicate
locking.
Could you check what happens when you reduce autovacuum_naptime to
(say) '5s' in postgresql.conf? Does that change make the G2-item cycle
issue manifest itself earlier? And can you discern any pattern like
that yourself?
It seems kind of inconvenient to run Jepsen -- I suppose I could use
Docker or something like that, but I don't have experience with it.
What do you think is the simplest workflow for somebody that just
wants to recreate your result on a Debian system?
--
Peter Geoghegan
Kyle Kingsbury wrote:
SQL 92 says that the difference between repeatable read and serializable is
*exactly* the prohibition of P3 ("phantom").
You're probably refering to Table-9 in SQL-92, showing
that P3 can happen under Repeatable Read and cannot happen
under Serializable.
But it doesn't say that the *only* difference between RR and Serializable
is avoiding P3. When defining P1, P2, P3, it doesn't explicitly say
that these are the only anomalies that can arise from concurrency.
The PG doc refers to the other cases as "serialization anomalies".
Compared to the manual, https://wiki.postgresql.org/wiki/Serializable
has more in-depth explanations on the specifics of PG serializable
implementation.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 6/3/20 7:51 AM, Daniel Verite wrote:
But it doesn't say that the *only* difference between RR and > Serializable is avoiding P3. When defining P1, P2, P3, it doesn't >
explicitly say that these are the only anomalies that can arise from > concurrency.
Ah, yes, now I understand. I've been working with Berenson/Adya too long--I've
taken their conclusions for granted. You're right: the spec comes awfully close,
but doesn't say these are the *only* phenomena possible:
The following phenomena are possible: [description of P1, P2, and > P3] > The isolation levels are different with respect to phenomena P1, P2, >
and P3.
Berenson et al., in "A Critique of ANSI SQL Isolation Levels", identified this
as a key challenge in interpreting the spec:
The isolation levels are defined by the phenomena they are forbidden > to experience.
But also:
The prominence of the table compared to this extra proviso leads to > a common misconception that disallowing the three phenomena implies >
serializability.
Which led Berenson et al. to claim that the SQL spec's definitions are a.)
ambiguous and b.) incomplete. They argue that the "strict" (or, in Adya, the
"anomaly") interpretation is incorrect, and construct a "broad" (Adya:
"preventative") interpretation in terms of a new phenomenon (P0: dirty writes),
and more general forms of P1, P2, and P3. The bulk of section 3 demonstrates
that the anomaly interpretations lead to weird results, and that what the ANSI
spec *intended* was the preventative interpretations.
Strict interpretations A1, A2, and A3 have unintended weaknesses. > The correct interpretations are the Broad ones. We assume in what > follows
that ANSI meant to define P1, P2, and P3.
For a more concise overview of this, see Adya's thesis, section 2.3. Adya goes
on to show that the preventative interpretation forbids some serializable
histories, and redefines the ANSI levels again in terms of generalized,
MVCC-friendly phenomena G0, G1, and G2. When I say Postgres violates repeatable
read, I mean in the sense that it allows G2-item, which is prevented by Adya
PL-2.99: repeatable read. See Adya 3.2.4 for his formulation of repeatable read,
which *does* differ from serializability only in terms of predicate-related
phenomena.
http://pmg.csail.mit.edu/papers/adya-phd.pdf
For a compact version of this argument and formalism, see Adya, Liskov, &
O'Neil's "Generalized Isolation Level Definitions":
http://bnrg.cs.berkeley.edu/~adj/cs262/papers/icde00.pdf.
That's why I was asking a few days ago whether Postgres had adopted the anomaly
interpretation--Since Postgres implements SI, it made sense that y'all would
have followed Berenson et al. in using the preventative interpretation, or Adya
in using generalized definitions. But... in the Postgres docs, and your comments
in this thread, it seems like y'all are going with the anomaly interpretation
instead. That'd explain why there's this extra "serialization anomaly"
difference between RR and serializable. That'd also explain why the Postgres
docs imply RR << SI, even though Berenson et al. and Adya both say RR >><< SI.
Is... that the right way to understand things?
--Kyle
On 6/2/20 7:17 PM, Peter Geoghegan wrote:
On Tue, Jun 2, 2020 at 10:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
I'd be surprised if your existing test cases needed any adjustment. My
guess is that this won't take long.You said it takes about a minute in your opening e-mail; how
consistent is this? I note from the Postgres logs you provided that
Postgres starts accepting connections at 2020-05-31 18:50:27.580, and
shows its last log message at 2020-05-31 18:51:29.781 PDT. So it's
suspiciously close to *exactly* one minute.
I set the test duration to 60 seconds for those runs, but it'll break in as
little as 10. That's less of a sure thing though. :)
Note that
autovacuum_naptime has as its default '1min'. Your workload probably
generates a lot of index bloat, which may tend to cause autovacuum to
want to delete whole B-Tree leaf pages, which impacts predicate
locking.
With the default (debian) postgresql.conf, which has autovacuum_naptime
commented out (default 1min), I see anomalies at (just picking a random recent
test) 8.16 seconds, 9.76 seconds, and 19.6 seconds. Another run: 28.0 seconds,
32.3 seconds.
Could you check what happens when you reduce autovacuum_naptime to
(say) '5s' in postgresql.conf? Does that change make the G2-item cycle
issue manifest itself earlier? And can you discern any pattern like
that yourself?
It doesn't look like setting autovacuum_naptime makes a difference.
It seems kind of inconvenient to run Jepsen -- I suppose I could use
Docker or something like that, but I don't have experience with it.
What do you think is the simplest workflow for somebody that just
wants to recreate your result on a Debian system?
I am really sorry about that--I know it's not convenient. Jepsen's built for
testing whole distributed systems, and is probably a bit overkill for testing a
single Postgres process. I don't have any experience with Docker, but I think
Docker Compose might be a good option for a single-node system? I apologize--I
*just* started writing this test against Debian Buster a few days ago, and the
existing AWS Marketplace and Docker Compose environments are still on Stretch,
so on top of setting up a Jepsen environment you also gotta do a Debian upgrade.
:'-O
I'll see about writing a version of the test that doesn't use any of the
automation, so you can point it at a local postgres instance. Then all you'll
need is lein and a jdk.
--Kyle
On 6/2/20 7:13 PM, Thomas Munro wrote:
That's (unfortunately) a known problem under concurrency. It'd be
very nice to fix that, but it's an independent problem relating to DDL
(not just tables, and not just IF EXISTS DDL, but anything modifying
catalogues can race in this way and miss out on "nice" error messages
or the IF EXISTS no-op). Here's a good short summary:
Ah, yes, this does explain it, thank you! I was a bit concerned, because I know
Postgres has a reputation for having transactional DDL. I guess this part of the
API isn't. :)
--Kyle
On 6/3/20 10:20 AM, Kyle Kingsbury wrote:
On 6/2/20 7:17 PM, Peter Geoghegan wrote: >> It seems kind of inconvenient to run Jepsen -- I suppose I could >> use
Docker or something like that, but I don't have experience >> with it. What do
you think is the simplest workflow for somebody >> that just wants to recreate
your result on a Debian system? > > I'll see about writing a version of the test
that doesn't use any of > the automation, so you can point it at a local
postgres instance. > Then all you'll need is lein and a jdk.
OK, I think we're all set. With Jepsen 0ec25ec3, you should be able to run:
cd stolon;
lein run test-all -w append --max-writes-per-key 4 --concurrency 50 -r 500
--isolation serializable --time-limit 60 --nemesis none --existing-postgres
--node localhost --no-ssh --postgres-user jepsen --postgres-password pw
... and it'll connect to an already-running instance of Postgres on localhost
(or whatever you want to connect to) using the provided username and password.
It expects a DB with the same name as the user. You'll need a JDK (1.8 or
higher), and Leiningen (https://leiningen.org/), which should be pretty easy to
install. :)
Could you test Postgres 9.5? It would be nice to determine if this > is a new issue, or a regression.
I can also confirm that Postgres 9.5.22, 10.13, and 11.8 all exhibit the same
G2-item anomalies as 12.3. It looks like this issue has been here a while!
(Either that, or... my transactions are malformed?).
Is there additional debugging data I can get you that'd be helpful? Or if you'd
like, I can hop in an IRC or whatever kind of chat/video session you'd like to
help you get these Jepsen tests running.
--Kyle