Read Uncommitted

Started by Simon Riggsabout 6 years ago34 messages

simon@2ndquadrant.com

about 6 years ago

1 attachment(s)

I present a patch to allow READ UNCOMMITTED that is simple, useful and
efficient. This was previously thought to have no useful definition within
PostgreSQL, though I have identified a use case for diagnostics and
recovery that merits adding a short patch to implement it.

My docs for this are copied here:

In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>
architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.

In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result.

The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.

In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.

This is a new and surprising thought, so please review the attached patch.

Please notice that almost all of the infrastructure already exists to
support this, so this patch does very little. It avoids additional locking
on main execution paths and as far as I am aware, does not break anything.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

Attachments:

read_uncommitted.v1.patchapplication/octet-stream; name=read_uncommitted.v1.patchDownload

diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index f8c9655111..ab73463b53 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -238,7 +238,7 @@
          Not possible
         </entry>
         <entry>
-         Allowed, but not in PG
+         Allowed, but not generally useful
         </entry>
         <entry>
          Possible
@@ -268,11 +268,10 @@
 
    <para>
     In <productname>PostgreSQL</productname>, you can request any of
-    the four standard transaction isolation levels, but internally only
-    three distinct isolation levels are implemented, i.e. PostgreSQL's
-    Read Uncommitted mode behaves like Read Committed.  This is because
-    it is the only sensible way to map the standard isolation levels to
-    PostgreSQL's multiversion concurrency control architecture.
+    the four standard transaction isolation levels, but Read Uncommitted
+    is mostly irrelevant because there are few use cases where this
+    isolation level does anything useful for applications in comparison
+    with PostgreSQL's multiversion concurrency control architecture.
    </para>
 
    <para>
@@ -784,6 +783,54 @@ ERROR:  could not serialize access due to read/write dependencies among transact
     </itemizedlist>
    </para>
   </sect2>
+
+  <sect2 id="xact-read-uncommitted">
+   <title>Read Uncommitted Isolation Level</title>
+
+   <indexterm>
+    <primary>transaction isolation level</primary>
+    <secondary>read uncommitted</secondary>
+   </indexterm>
+
+   <indexterm>
+    <primary>read uncommitted</primary>
+   </indexterm>
+
+   <para>
+    In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>
+    architecture, readers are not blocked by writers, so in general
+    you should have no need for this transaction isolation level.
+   </para>
+
+   <para>
+    In general, read uncommitted will return inconsistent results and
+    wrong answers. If you look at the changes made by a transaction
+    while it continues to make changes then you may get partial results
+    from queries, or you may miss index entries that haven't yet been
+    written. However, if you are reading transactions that are paused
+    at the end of their execution for whatever reason then you can
+    see a consistent result.
+   </para>
+
+   <para>
+    The main use case for this transaction isolation level is for
+    investigating or recovering data. Examples of this would be when
+    inspecting the writes made by a locked or hanging transaction, when
+    you are running queries on a standby node that is currently paused,
+    such as when a standby node has halted at a recovery target with
+    <literal>recovery_target_inclusive = false</literal> or when you
+    need to inspect changes made by an in-doubt prepared transaction to
+    decide whether to commit or abort that transaction.
+   </para>
+
+   <para>
+    In <productname>PostgreSQL</productname> read uncommitted mode gives
+    a consistent snapshot of the currently running transactions at the
+    time the snapshot was taken. Transactions starting after that time
+    will not be visible, even though they are not yet committed.
+   </para>
+  </sect2>
+
  </sect1>
 
   <sect1 id="explicit-locking">
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3e3646716f..d669a83418 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1058,7 +1058,7 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
 			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
 						HeapTupleHeaderGetRawXmin(tuple));
-		else
+		else if (XactIsoLevel != XACT_READ_UNCOMMITTED)
 		{
 			/* it must have aborted or crashed */
 			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
@@ -1103,6 +1103,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		}
 		if (XidInMVCCSnapshot(xmax, snapshot))
 			return true;
+		if (XactIsoLevel == XACT_READ_UNCOMMITTED)
+			return false;
 		if (TransactionIdDidCommit(xmax))
 			return false;		/* updating transaction committed */
 		/* it must have aborted or crashed */
@@ -1122,6 +1124,9 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
 			return true;
 
+		if (XactIsoLevel == XACT_READ_UNCOMMITTED)
+			return false;
+
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
 		{
 			/* it must have aborted or crashed */
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 13bcbe77de..2335be5306 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1743,10 +1743,45 @@ GetSnapshotData(Snapshot snapshot)
 	RecentXmin = xmin;
 
 	snapshot->xmin = xmin;
-	snapshot->xmax = xmax;
-	snapshot->xcnt = count;
-	snapshot->subxcnt = subcount;
-	snapshot->suboverflowed = suboverflowed;
+	if (XactIsoLevel == XACT_READ_UNCOMMITTED)
+	{
+		/*
+		 * In XACT_READ_UNCOMMITTED we want a consistent snapshot, just
+		 * one that can see data written by transactions currently in
+		 * progress. So any transactions started AFTER this point will
+		 * still be invisible to us. We don't use the normal latest
+		 * Committed+1 because that misses many currently executing xids.
+		 * This is safe since we read the value atomically, so we
+		 * don't need XidGenLock.
+		 *
+		 * This is a useful definition of a consistent snapshot when
+		 * we want to see the effects of unresolved 2PC transactions
+		 * or when recovery has paused. In other cases, transactions
+		 * might continue to write and so the results might still be
+		 * inconsistent in many cases; caveat emptor.
+		 */
+		snapshot->xmax = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+
+		/*
+		 * We still need to calculate xmin correctly, so we respect the
+		 * normal limits for cleaning up as we scan. This is needed in
+		 * recovery in case we want to keep using this snapshot after
+		 * the standby is promoted.
+		 *
+		 * Other values must be zeroed otherwise the snapshot wouldn't
+		 * be able to see the uncommitted transactions.
+		 */
+		snapshot->xcnt = 0;
+		snapshot->subxcnt = 0;
+		snapshot->suboverflowed = false;
+	}
+	else
+	{
+		snapshot->xmax = xmax;
+		snapshot->xcnt = count;
+		snapshot->subxcnt = subcount;
+		snapshot->suboverflowed = suboverflowed;
+	}
 
 	snapshot->curcid = GetCurrentCommandId(false);

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 6 years ago

In reply to: Simon Riggs (#1)

Re: Read Uncommitted

On 18.12.2019 13:01, Simon Riggs wrote:

I present a patch to allow READ UNCOMMITTED that is simple, useful and
efficient. This was previously thought to have no useful definition
within PostgreSQL, though I have identified a use case for diagnostics
and recovery that merits adding a short patch to implement it.

My docs for this are copied here:

In <productname>PostgreSQL</productname>'s
<acronym>MVCC</acronym>./configure
--prefix=/home/knizhnik/postgresql/dist --enable-debug
--enable-cassert CFLAGS=-O0

architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.

In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result.

The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.

In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.

This is a new and surprising thought, so please review the attached patch.

Please notice that almost all of the infrastructure already exists to
support this, so this patch does very little. It avoids additional
locking on main execution paths and as far as I am aware, does not
break anything.

--
Simon Riggshttp://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

As far as I understand with "read uncommitted" policy we can see two
versions of the same tuple if it was updated by two transactions both of
which were started before us and committed during table traversal by
transaction with "read uncommitted" policy. Certainly "read uncommitted"
means that we are ready to get inconsistent results, but is it really
acceptable to multiple versions of the same tuple?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Simon Riggs (#1)

Re: Read Uncommitted

Simon Riggs <simon@2ndquadrant.com> writes:

I present a patch to allow READ UNCOMMITTED that is simple, useful and
efficient.

Won't this break entirely the moment you try to read a tuple containing
toasted-out-of-line values? There's no guarantee that the toast-table
entries haven't been vacuumed away.

I suspect it can also be broken by cases involving, eg, dropped columns.
There are a lot of assumptions in the system that no one will ever try
to read dead tuples.

The fact that you can construct a use-case in which it's good for
something doesn't make it safe in general :-(

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Konstantin Knizhnik (#2)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
wrote:

As far as I understand with "read uncommitted" policy we can see two

versions of the same tuple if it was updated by two transactions both of
which were started before us and committed during table traversal by
transaction with "read uncommitted" policy. Certainly "read uncommitted"
means that we are ready to get inconsistent results, but is it really
acceptable to multiple versions of the same tuple?

"In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result."

I think I already covered your concerns in my suggested docs for this
feature.

I'm not suggesting it for general use.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#3)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 14:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I present a patch to allow READ UNCOMMITTED that is simple, useful and
efficient.

Won't this break entirely the moment you try to read a tuple containing
toasted-out-of-line values? There's no guarantee that the toast-table
entries haven't been vacuumed away.

I suspect it can also be broken by cases involving, eg, dropped columns.
There are a lot of assumptions in the system that no one will ever try
to read dead tuples.

This was my first concern when I thought about it, but I realised that by
taking a snapshot and then calculating xmin normally, this problem would go
away.

So this won't happen with the proposed patch.

The fact that you can construct a use-case in which it's good for
something doesn't make it safe in general :-(

I agree that safety is a concern, but I don't see any safety issues in the
patch as proposed.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Simon Riggs (#5)

Re: Read Uncommitted

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> wrote:

This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally, this problem would go away.

Why? As soon as a transaction aborts, the TOAST rows can be vacuumed
away, but the READ UNCOMMITTED transaction might've already seen the
main tuple. This is not even a particularly tight race, necessarily,
since for example the table might be scanned, feeding tuples into a
tuplesort, and then the detoating might happen further up in the query
tree after the sort has completed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Robert Haas (#6)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
wrote:

This was my first concern when I thought about it, but I realised that

by taking a snapshot and then calculating xmin normally, this problem would
go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading transaction
will end with an ERROR, similar to a serializable error.

And that won't happen in the use cases I've explicitly described this as
being useful for, which is where the writing transactions have completed
executing.

I'm not claiming any useful behavior outside of those specific use cases;
this is not some magic discovery that goes faster - the user has absolutely
no reason to run this whatsoever unless they want to see uncommitted data.
There is a very explicit warning advising against using it for anything
else.

Just consider this part of the recovery toolkit.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Simon Riggs (#7)

Re: Read Uncommitted

On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

So this is the same discussion as elsewhere about potentially aborted transactions...

Yep.

AFAIK, the worst that happens in that case is that the reading transaction will end with an ERROR, similar to a serializable error.

I'm not convinced of that. There's a big difference between a
serializable error, which is an error that is expected to be
user-facing and was designed with that in mind, and just failing a
bunch of random sanity checks all over the backend. If those sanity
checks happen to be less than comprehensive, which I suspect is
likely, there will probably be scenarios where you can crash a backend
and cause a system-wide restart. And you can probably also return just
plain wrong answers to queries in some scenarios.

Just consider this part of the recovery toolkit.

I agree that it would be useful to have a recovery toolkit for reading
uncommitted data, but I think a lot more thought needs to be given to
how such a thing should be designed. If you just add something called
READ UNCOMMITTED, people are going to expect it to have *way* saner
semantics than this will. They'll use it routinely, not just as a
last-ditch mechanism to recover otherwise-lost data. And I'm
reasonably confident that will not work out well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Simon Riggs (#7)

Re: Read Uncommitted

Simon Riggs <simon@2ndquadrant.com> writes:

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading transaction
will end with an ERROR, similar to a serializable error.

No, the worst case is transactions trying to read invalid data, resulting
in either crashes or exploitable security breaches (in the usual vein of
what can go wrong if you can get the C code to follow an invalid pointer).

This seems possible, for example, if you can get a transaction to read
uncommitted data that was written according to some other rowtype than
what the reading transaction thinks the table rowtype is. Casting my eyes
through AlterTableGetLockLevel(), it looks like all the easy ways to break
it like that are safe (for now) because they require AccessExclusiveLock.
But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation. I'm doubtful that its use-case is worth that.

And that won't happen in the use cases I've explicitly described this as
being useful for, which is where the writing transactions have completed
executing.

My concerns, at least, are not about whether this has any interesting
use-cases. They're about whether the feature can be abused to cause
security problems. I think the odds are fair that that'd be true
even today, and higher that it'd become true sometime in the future.

regards, tom lane

#10

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Simon Riggs (#7)

Re: Read Uncommitted

On 12/18/19 10:06 AM, Simon Riggs wrote:

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com
<mailto:simon@2ndquadrant.com>> wrote:

This was my first concern when I thought about it, but I realised

that by taking a snapshot and then calculating xmin normally, this
problem would go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted
transactions... AFAIK, the worst that happens in that case is that
the reading transaction will end with an ERROR, similar to a
serializable error.

And that won't happen in the use cases I've explicitly described this
as being useful for, which is where the writing transactions have
completed executing.

I'm not claiming any useful behavior outside of those specific use
cases; this is not some magic discovery that goes faster - the user
has absolutely no reason to run this whatsoever unless they want to
see uncommitted data. There is a very explicit warning advising
against using it for anything else.

Just consider this part of the recovery toolkit.

In that case, don't call it "read uncommitted". Call it some other
thing entirely. Users coming from other databases may request
"read uncommitted" isolation expecting something that works.
Currently, that gets promoted to "read committed" and works. After
your change, that simply breaks and gives them an error.

I was about to write something about security and stability problems,
but Robert and Tom did elsewhere, so I'll just echo their concerns.

Looking at the regression tests, I'm surprised read uncommitted gets
so little test coverage. There's a test in src/test/isolation but
nothing at all in src/test/regression covering this isolation level.

The one in src/test/isolation doesn't look very comprehensive. I'd
at least expect a test that verifies you don't get a syntax error
when you request READ UNCOMMITTED isolation from SQL.

--
Mark Dilger

#11

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#9)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 18:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading

transaction

will end with an ERROR, similar to a serializable error.

No, the worst case is transactions trying to read invalid data, resulting
in either crashes or exploitable security breaches (in the usual vein of
what can go wrong if you can get the C code to follow an invalid pointer).

Yes, but we're not following any pointers as a result of this. The output
is just rows.

This seems possible, for example, if you can get a transaction to read
uncommitted data that was written according to some other rowtype than
what the reading transaction thinks the table rowtype is. Casting my eyes
through AlterTableGetLockLevel(), it looks like all the easy ways to break
it like that are safe (for now) because they require AccessExclusiveLock.
But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation. I'm doubtful that its use-case is worth that.

I think we can limit it to Read Only transactions without any limitation on
the proposed use cases.

But I'll think some more about that, just in case.

And that won't happen in the use cases I've explicitly described this as
being useful for, which is where the writing transactions have completed
executing.

My concerns, at least, are not about whether this has any interesting
use-cases. They're about whether the feature can be abused to cause
security problems. I think the odds are fair that that'd be true
even today, and higher that it'd become true sometime in the future.

I share your concerns. We have no need or reason to make a quick decision
on this patch.

I submit this patch only as a useful tool for recovering data.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#12

Heikki Linnakangas

hlinnaka@iki.fi

about 6 years ago

In reply to: Mark Dilger (#10)

Re: Read Uncommitted

On 18/12/2019 20:46, Mark Dilger wrote:

On 12/18/19 10:06 AM, Simon Riggs wrote:

Just consider this part of the recovery toolkit.

In that case, don't call it "read uncommitted". Call it some other
thing entirely. Users coming from other databases may request
"read uncommitted" isolation expecting something that works.
Currently, that gets promoted to "read committed" and works. After
your change, that simply breaks and gives them an error.

I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.

- Heikki

#13

Stephen Frost

sfrost@snowman.net

about 6 years ago

In reply to: Robert Haas (#8)

Re: Read Uncommitted

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

Just consider this part of the recovery toolkit.

I agree that it would be useful to have a recovery toolkit for reading
uncommitted data, but I think a lot more thought needs to be given to
how such a thing should be designed. If you just add something called
READ UNCOMMITTED, people are going to expect it to have *way* saner
semantics than this will. They'll use it routinely, not just as a
last-ditch mechanism to recover otherwise-lost data. And I'm
reasonably confident that will not work out well.

+1.

Thanks,

Stephen

#14

Finnerty, Jim

jfinnert@amazon.com

about 6 years ago

In reply to: Stephen Frost (#13)

Re: Read Uncommitted

Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don't care very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past 5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result.

On 12/18/19, 2:35 PM, "Stephen Frost" <sfrost@snowman.net> wrote:

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

Just consider this part of the recovery toolkit.

I agree that it would be useful to have a recovery toolkit for reading
uncommitted data, but I think a lot more thought needs to be given to
how such a thing should be designed. If you just add something called
READ UNCOMMITTED, people are going to expect it to have *way* saner
semantics than this will. They'll use it routinely, not just as a
last-ditch mechanism to recover otherwise-lost data. And I'm
reasonably confident that will not work out well.

+1.

Thanks,

Stephen

#15

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Finnerty, Jim (#14)

Re: Read Uncommitted

"Finnerty, Jim" <jfinnert@amazon.com> writes:

Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don't care very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past 5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result.

It's fairly questionable whether there's any real advantage to be gained
by READ UNCOMMITTED in that sort of scenario --- almost all the tuples
you'd be looking at would be hinted as committed-good, ordinarily, so that
bypassing the relevant checks isn't going to save much. But I take your
point that people would *think* that READ UNCOMMITTED could be used that
way, if they come from some other DBMS. So this reinforces Mark's point
that if we provide something like this, it shouldn't be called READ
UNCOMMITTED. That should be reserved for something that has reasonably
consistent, standards-compliant behavior.

regards, tom lane

#16

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Heikki Linnakangas (#12)

Re: Read Uncommitted

On Wed, Dec 18, 2019 at 2:29 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Mark Dilger (#10)

2 attachment(s)

Read Uncommitted regression test coverage

Over in [1]/messages/by-id/CANP8+j+mgWfcX9cTPsk7t+1kQCxgyGqHTR5R7suht7mCm_x_hA@mail.gmail.com, I became concerned that, although postgres supports
Read Uncommitted transaction isolation (by way of Read Committed
mode), there was very little test coverage for it:

On 12/18/19 10:46 AM, Mark Dilger wrote:

Looking at the regression tests, I'm surprised read uncommitted gets
so little test coverage. There's a test in src/test/isolation but
nothing at all in src/test/regression covering this isolation level.

The one in src/test/isolation doesn't look very comprehensive. I'd
at least expect a test that verifies you don't get a syntax error
when you request READ UNCOMMITTED isolation from SQL.

The attached patch set adds a modicum of test coverage for this.
Do others feel these tests are worth the small run time overhead
they add?

--
Mark Dilger

[1]: /messages/by-id/CANP8+j+mgWfcX9cTPsk7t+1kQCxgyGqHTR5R7suht7mCm_x_hA@mail.gmail.com
/messages/by-id/CANP8+j+mgWfcX9cTPsk7t+1kQCxgyGqHTR5R7suht7mCm_x_hA@mail.gmail.com

Attachments:

0001-regress.patchtext/x-patch; charset=UTF-8; name=0001-regress.patchDownload

diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 1338b2b23e..873b81950b 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -690,6 +690,9 @@ select * from twoconstraints;
 drop table twoconstraints;
 -- check handling of self-conflicts at various isolation levels
 create table selfconflict (f1 int primary key, f2 int);
+begin transaction isolation level read uncommitted;
+insert into selfconflict values (1,1), (1,2) on conflict do nothing;
+commit;
 begin transaction isolation level read committed;
 insert into selfconflict values (1,1), (1,2) on conflict do nothing;
 commit;
@@ -699,6 +702,11 @@ commit;
 begin transaction isolation level serializable;
 insert into selfconflict values (3,1), (3,2) on conflict do nothing;
 commit;
+begin transaction isolation level read uncommitted;
+insert into selfconflict values (4,1), (4,2) on conflict(f1) do update set f2 = 0;
+ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
+HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
+commit;
 begin transaction isolation level read committed;
 insert into selfconflict values (4,1), (4,2) on conflict(f1) do update set f2 = 0;
 ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index 43691cd335..2f68398503 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -421,6 +421,10 @@ drop table twoconstraints;
 
 create table selfconflict (f1 int primary key, f2 int);
 
+begin transaction isolation level read uncommitted;
+insert into selfconflict values (1,1), (1,2) on conflict do nothing;
+commit;
+
 begin transaction isolation level read committed;
 insert into selfconflict values (1,1), (1,2) on conflict do nothing;
 commit;
@@ -433,6 +437,10 @@ begin transaction isolation level serializable;
 insert into selfconflict values (3,1), (3,2) on conflict do nothing;
 commit;
 
+begin transaction isolation level read uncommitted;
+insert into selfconflict values (4,1), (4,2) on conflict(f1) do update set f2 = 0;
+commit;
+
 begin transaction isolation level read committed;
 insert into selfconflict values (4,1), (4,2) on conflict(f1) do update set f2 = 0;
 commit;

0002-isolation.patchtext/x-patch; charset=UTF-8; name=0002-isolation.patchDownload

diff --git a/src/test/isolation/expected/toast-isolation.out b/src/test/isolation/expected/toast-isolation.out
new file mode 100644
index 0000000000..5db1e281f8
--- /dev/null
+++ b/src/test/isolation/expected/toast-isolation.out
@@ -0,0 +1,42 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_insert s2_select s1_update s2_select s1_delete s2_select s1_commit s1_vacuum s2_select s1_reinsert s2_select s1_nullify s2_select s1_vacuum s2_select s2_commit
+step s1_insert: INSERT INTO doc VALUES (3, repeat('3',30000));
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+1              1              10000          
+2              1              20000          
+step s1_update: UPDATE doc SET id = 4, document = repeat('4',40000);
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+1              1              10000          
+2              1              20000          
+step s1_delete: DELETE FROM doc;
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+1              1              10000          
+2              1              20000          
+step s1_commit: COMMIT;
+step s1_vacuum: VACUUM;
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+step s1_reinsert: INSERT INTO doc VALUES (5, repeat('5',50000));
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+5              1              50000          
+step s1_nullify: UPDATE doc SET document = NULL;
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+5              1                             
+step s1_vacuum: VACUUM;
+step s2_select: SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC;
+id             count          sum            
+
+5              1                             
+step s2_commit: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index a2fa19230d..59528b7efd 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -88,3 +88,4 @@ test: plpgsql-toast
 test: truncate-conflict
 test: serializable-parallel
 test: serializable-parallel-2
+test: toast-isolation
diff --git a/src/test/isolation/specs/toast-isolation.spec b/src/test/isolation/specs/toast-isolation.spec
new file mode 100644
index 0000000000..d6a538f9eb
--- /dev/null
+++ b/src/test/isolation/specs/toast-isolation.spec
@@ -0,0 +1,41 @@
+# READ UNCOMMITTED isolation level is allowed per SQL spec to expose updates in
+# one uncommitted transaction to a reader in another transaction, though the
+# spec does not require such exposure.  PostgreSQL implements this isolation
+# level by using READ COMMITTED, which disallows such exposure.  As such, we
+# should not see the updates in session s1 when querying from session s2.
+# However, if we ever change the implementation to have a distinct READ
+# UNCOMMITTED mode, it will still be necessary that the toast table holding the
+# documents and the main table holding the IDs work together rationally.  Check
+# that here, as a defence against half-baked future implementations of READ
+# UNCOMMITTED isolation.
+
+setup
+{
+	CREATE TABLE doc (id INTEGER, document TEXT);
+	ALTER TABLE doc ALTER document SET STORAGE external;
+	INSERT INTO doc VALUES (1, repeat('1',10000));
+	INSERT INTO doc VALUES (2, repeat('2',20000));
+}
+
+teardown
+{
+	DROP TABLE doc;
+}
+
+session "s1"
+setup 				{ BEGIN TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; }
+step "s1_insert"	{ INSERT INTO doc VALUES (3, repeat('3',30000)); }
+step "s1_update"	{ UPDATE doc SET id = 4, document = repeat('4',40000); }
+step "s1_delete"	{ DELETE FROM doc; }
+step "s1_commit"	{ COMMIT; }
+step "s1_vacuum"	{ VACUUM; }
+step "s1_reinsert"	{ INSERT INTO doc VALUES (5, repeat('5',50000)); }
+step "s1_nullify"	{ UPDATE doc SET document = NULL; }
+
+session "s2"
+setup 				{ BEGIN TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; }
+step "s2_select"	{ SELECT id, COUNT(*), SUM(length(document)) FROM doc GROUP BY id ORDER BY id ASC; }
+step "s2_commit"	{ COMMIT; }
+
+permutation "s1_insert" "s2_select" "s1_update" "s2_select" "s1_delete" "s2_select" "s1_commit" "s1_vacuum" "s2_select" "s1_reinsert" "s2_select" "s1_nullify" "s2_select" "s1_vacuum" "s2_select" "s2_commit"
+

#18

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Mark Dilger (#17)

Re: Read Uncommitted regression test coverage

Mark Dilger <hornschnorter@gmail.com> writes:

The one in src/test/isolation doesn't look very comprehensive. I'd
at least expect a test that verifies you don't get a syntax error
when you request READ UNCOMMITTED isolation from SQL.

The attached patch set adds a modicum of test coverage for this.
Do others feel these tests are worth the small run time overhead
they add?

No. As you pointed out yourself, READ UNCOMMITTED is the same as READ
COMMITTED, so there's hardly any point in testing its semantic behavior.
One or two tests that check that it is accepted by the grammar seem
like plenty (and even there, what's there to break? If bison starts
failing us to that extent, we've got bigger problems.)

Obviously, if we made it behave differently from READ COMMITTED, then
it would need testing ... but the nature and extent of such testing
would depend a lot on what we did to it, so I'm not eager to try to
predict the need in advance.

regards, tom lane

#19

David Steele

david@pgmasters.net

about 6 years ago

In reply to: Heikki Linnakangas (#12)

Re: Read Uncommitted

On 12/18/19 2:29 PM, Heikki Linnakangas wrote:

On 18/12/2019 20:46, Mark Dilger wrote:

On 12/18/19 10:06 AM, Simon Riggs wrote:

Just consider this part of the recovery toolkit.

In that case, don't call it "read uncommitted". Call it some other
thing entirely. Users coming from other databases may request
"read uncommitted" isolation expecting something that works.
Currently, that gets promoted to "read committed" and works. After
your change, that simply breaks and gives them an error.

I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.

+1.

--
-David
david@pgmasters.net

#20

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#15)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 20:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Finnerty, Jim" <jfinnert@amazon.com> writes:

Many will want to use it to do aggregation, e.g. a much more efficient

COUNT(*), because they want performance and don't care very much about
transaction consistency. E.g. they want to compute SUM(sales) by
salesperson, region for the past 5 years, and don't care very much if some
concurrent transaction aborted in the middle of computing this result.

It's fairly questionable whether there's any real advantage to be gained
by READ UNCOMMITTED in that sort of scenario --- almost all the tuples
you'd be looking at would be hinted as committed-good, ordinarily, so that
bypassing the relevant checks isn't going to save much.

Agreed; this was not intended to give any kind of backdoor benefit and I
don't see any, just tears.

But I take your
point that people would *think* that READ UNCOMMITTED could be used that
way, if they come from some other DBMS. So this reinforces Mark's point
that if we provide something like this, it shouldn't be called READ
UNCOMMITTED.

Seems like general agreement on that point from others on this thread.

That should be reserved for something that has reasonably
consistent, standards-compliant behavior.

Since we're discussing it, exactly what standard are we talking about here?
I'm not saying I care about that, just to complete the discussion.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#21

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Heikki Linnakangas (#12)

Re: Read Uncommitted

On Wed, 18 Dec 2019 at 19:29, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 18/12/2019 20:46, Mark Dilger wrote:

On 12/18/19 10:06 AM, Simon Riggs wrote:

Just consider this part of the recovery toolkit.

In that case, don't call it "read uncommitted". Call it some other
thing entirely. Users coming from other databases may request
"read uncommitted" isolation expecting something that works.
Currently, that gets promoted to "read committed" and works. After
your change, that simply breaks and gives them an error.

I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.

So the consensus is for a more-specifically named facility.

I was aiming for something that would allow general SELECTs to run with a
snapshot that can see uncommitted xacts, so making it a SRF wouldn't really
allow that.

Not really sure where to go with the UI for this.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#22

Andres Freund

andres@anarazel.de

about 6 years ago

In reply to: Simon Riggs (#7)

Re: Read Uncommitted

Hi,

On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
wrote:

This was my first concern when I thought about it, but I realised that

by taking a snapshot and then calculating xmin normally, this problem would
go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading transaction
will end with an ERROR, similar to a serializable error.

I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.

I think there's plenty other problems too, not just toast. There's
e.g. some parts of the system that access catalogs using a normal
snapshot - which might not actually be consistent, because we have
various places where we have to increment the command counter multiple
times as part of a larger catalog manipulation.

Greetings,

Andres Freund

#23

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Tom Lane (#18)

1 attachment(s)

Re: Read Uncommitted regression test coverage

On 12/18/19 2:17 PM, Tom Lane wrote:

Mark Dilger <hornschnorter@gmail.com> writes:

The one in src/test/isolation doesn't look very comprehensive. I'd
at least expect a test that verifies you don't get a syntax error
when you request READ UNCOMMITTED isolation from SQL.

The attached patch set adds a modicum of test coverage for this.
Do others feel these tests are worth the small run time overhead
they add?

No. As you pointed out yourself, READ UNCOMMITTED is the same as READ
COMMITTED, so there's hardly any point in testing its semantic behavior.
One or two tests that check that it is accepted by the grammar seem
like plenty (and even there, what's there to break? If bison starts
failing us to that extent, we've got bigger problems.)

The lack of testing in the current system is so complete that if you
go into gram.y and remove READ UNCOMMITTED from the grammar, not one
test in check-world fails.

Somebody doing something like what Simon is suggesting might refactor
the code in a way that unintentionally breaks this isolation level, and
we'd not know about it until users started complaining.

The attached patch is pretty cheap. Perhaps you'll like it better?

--
Mark Dilger

Attachments:

0003-grammar.patchtext/x-patch; charset=UTF-8; name=0003-grammar.patchDownload

diff --git a/src/test/regress/expected/transactions.out b/src/test/regress/expected/transactions.out
index 1b03310029..2ab223cf8e 100644
--- a/src/test/regress/expected/transactions.out
+++ b/src/test/regress/expected/transactions.out
@@ -40,6 +40,13 @@ SELECT * FROM aggtest;
   42 |  324.78
 (4 rows)
 
+-- explicitly stating isolation level
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
+SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ROLLBACK;
 -- Read-only tests
 CREATE TABLE writetest (a int);
 CREATE TEMPORARY TABLE temptest (a int);
diff --git a/src/test/regress/sql/transactions.sql b/src/test/regress/sql/transactions.sql
index bf1016489d..6335bd52c0 100644
--- a/src/test/regress/sql/transactions.sql
+++ b/src/test/regress/sql/transactions.sql
@@ -33,6 +33,13 @@ SELECT oid FROM pg_class WHERE relname = 'disappear';
 -- should have members again
 SELECT * FROM aggtest;
 
+-- explicitly stating isolation level
+BEGIN;
+SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
+SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
+SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ROLLBACK;
 
 -- Read-only tests

#24

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Andres Freund (#22)

Re: Read Uncommitted

On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
wrote:

This was my first concern when I thought about it, but I realised

that

by taking a snapshot and then calculating xmin normally, this problem

would

go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading

transaction

will end with an ERROR, similar to a serializable error.

I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.

On the patch as proposed this wouldn't be possible because a toast row
can't be vacuumed and then reused while holding back xmin, at least as I
understand it.

I think there's plenty other problems too, not just toast. There's
e.g. some parts of the system that access catalogs using a normal
snapshot - which might not actually be consistent, because we have
various places where we have to increment the command counter multiple
times as part of a larger catalog manipulation.

It seems possible that catalog access would be the thing that makes this
difficult. Cache invalidations wouldn't yet have been fired, so that would
lead to rather weird errors, and as you say, potential issues from data
type changes which would not be acceptable in a facility available to
non-superusers.

We could limit that to xacts that don't do DDL, which is a very small % of
xacts, but then those xacts are more likely to be ones you'd want to
recover or investigate.

So I now withdraw this patch as submitted and won't be resubmitting.

Thanks everyone for your input.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#25

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

about 6 years ago

In reply to: Simon Riggs (#4)

Re: Read Uncommitted

On 2019-12-18 16:14, Simon Riggs wrote:

On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

As far as I understand with "read uncommitted" policy we can see two
versions of the same tuple if it was updated by two transactions
both of which were started before us and committed during table
traversal by transaction with "read uncommitted" policy. Certainly
"read uncommitted" means that we are ready to get inconsistent
results, but is it really acceptable to multiple versions of the
same tuple?

"In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result."

I think I already covered your concerns in my suggested docs for this
feature.

Independent of the technical concerns, I don't think the SQL standard
allows the READ UNCOMMITTED level to behave in a way that violates the
logical requirements of the defined database schema. So if we wanted to
add this, we should probably name it something else.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26

Bernd Helmle

mailings@oopsware.de

about 6 years ago

In reply to: Simon Riggs (#21)

Re: Read Uncommitted

Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs:

So the consensus is for a more-specifically named facility.

I was aiming for something that would allow general SELECTs to run
with a
snapshot that can see uncommitted xacts, so making it a SRF wouldn't
really
allow that.

There's pg_dirtyread() [1]https://github.com/df7cb/pg_dirtyread around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.

[1]: https://github.com/df7cb/pg_dirtyread

Bernd

#27

Simon Riggs

simon@2ndquadrant.com

about 6 years ago

In reply to: Bernd Helmle (#26)

Re: Read Uncommitted

On Thu, 19 Dec 2019 at 12:42, Bernd Helmle <mailings@oopsware.de> wrote:

Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs:

So the consensus is for a more-specifically named facility.

I was aiming for something that would allow general SELECTs to run
with a
snapshot that can see uncommitted xacts, so making it a SRF wouldn't
really
allow that.

There's pg_dirtyread() [1] around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.

[1] https://github.com/df7cb/pg_dirtyread

As an example of an SRF for debugging purposes, sure, but then we already
had the example of pageinspect, which I wrote BTW, so wasn't unfamiliar
with the thought.

Note that pg_dirtyread has got nothing to do with the use cases I
described.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Solutions for the Enterprise

#28

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Simon Riggs (#24)

Re: Read Uncommitted

On 12/19/19 1:50 AM, Simon Riggs wrote:

It seems possible that catalog access would be the thing that makes this
difficult. Cache invalidations wouldn't yet have been fired, so that
would lead to rather weird errors, and as you say, potential issues from
data type changes which would not be acceptable in a facility available
to non-superusers.

We could limit that to xacts that don't do DDL, which is a very small %
of xacts, but then those xacts are more likely to be ones you'd want to
recover or investigate.

So I now withdraw this patch as submitted and won't be resubmitting.

Oh, I'm sorry to hear that. I thought this feature sounded useful, and
we were working out what its limitations were. What I gathered from
the discussion so far was:

- It should be called something other than READ UNCOMMITTED
- It should only be available to superusers, at least for the initial
implementation
- Extra care might be required to lock catalogs to avoid unsafe
operations that could lead to backends crashing or security
vulnerabilities
- Toast tables need to be handled with care

For the record, in case we revisit this idea in the future, which were
the obstacles that killed this patch?

Tom's point on that third item:

But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation. I'm doubtful that its use-case is worth that."

Anybody running SET TRANSACTION ISOLATION LEVEL RECOVERY might
have to get ExclusiveLock on most of the catalog tables. But that
would only be done if somebody starts a transaction using this
isolation level, which is not normal, so it shouldn't be a problem
under normal situations. If the lock level reduction that Tom
mentions was implemented, there would be no problem, as long as the
lock level you reduce to still blocks against ExclusiveLock, which
surely it must. If the transaction running in RECOVERY level isolation
can't get the locks, then it blocks and doesn't help the administrator
who is trying to use this feature, but that is no worse than the
present situation where the feature is entirely absent. When no
catalog changes are in flight, the administrator gets the locks and
can continue inspecting the in-process changes of other transactions.

Robert's point on that fourth item:

As soon as a transaction aborts, the TOAST rows can be vacuumed
away, but the READ UNCOMMITTED transaction might've already seen the
main tuple. This is not even a particularly tight race, necessarily,
since for example the table might be scanned, feeding tuples into a
tuplesort, and then the detoating might happen further up in the query
tree after the sort has completed.

I don't know if this could be fixed without adding overhead to toast
processing for non-RECOVERY transactions, but perhaps it doesn't need
to be fixed at all. Perhaps you just accept that in RECOVERY mode you
can't see toast data, and instead get NULLs for all such rows. Now,
that could have security implications if somebody defines a policy
where NULL in a toast column means "allow" rather than "deny" for
some issue, but if this RECOVERY mode is limited to superusers, that
isn't such a big objection.

There may be a number of other gotchas still to be resolved, but
abandoning the patch at this stage strikes me as premature.

--
Mark Dilger

#29

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Mark Dilger (#28)

Re: Read Uncommitted

On 12/19/19 7:08 AM, Mark Dilger wrote:

and instead get NULLs for all such rows

To clarify, I mean the toasted column is null for rows
where the value was stored in the toast table rather
than stored inline. I'd prefer some special value
that means "this datum unavailable" so that it could
be distinguished from an actual null, but no such value
comes to mind.

--
Mark Dilger

#30

Andres Freund

andres@anarazel.de

about 6 years ago

In reply to: Simon Riggs (#24)

Re: Read Uncommitted

Hi,

On 2019-12-19 09:50:44 +0000, Simon Riggs wrote:

On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
wrote:

This was my first concern when I thought about it, but I realised

that

by taking a snapshot and then calculating xmin normally, this problem

would

go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted
transactions...
AFAIK, the worst that happens in that case is that the reading

transaction

will end with an ERROR, similar to a serializable error.

I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.

On the patch as proposed this wouldn't be possible because a toast row
can't be vacuumed and then reused while holding back xmin, at least as I
understand it.

Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.

Greetings,

Andres Freund

#31

Andres Freund

andres@anarazel.de

about 6 years ago

In reply to: Mark Dilger (#28)

Re: Read Uncommitted

Hi,

On 2019-12-19 07:08:06 -0800, Mark Dilger wrote:

As soon as a transaction aborts, the TOAST rows can be vacuumed
away, but the READ UNCOMMITTED transaction might've already seen the
main tuple. This is not even a particularly tight race, necessarily,
since for example the table might be scanned, feeding tuples into a
tuplesort, and then the detoating might happen further up in the query
tree after the sort has completed.

I don't know if this could be fixed without adding overhead to toast
processing for non-RECOVERY transactions, but perhaps it doesn't need
to be fixed at all. Perhaps you just accept that in RECOVERY mode you
can't see toast data, and instead get NULLs for all such rows. Now,
that could have security implications if somebody defines a policy
where NULL in a toast column means "allow" rather than "deny" for
some issue, but if this RECOVERY mode is limited to superusers, that
isn't such a big objection.

I mean, that's just a small part of the issue. You can get *different*
data back for toast columns - incompatible with the datatype, leading to
crashes. You can get *different* data back for the same query, running
it twice, because data that was just inserted can get pruned away if the
inserting transaction aborted.

There may be a number of other gotchas still to be resolved, but
abandoning the patch at this stage strikes me as premature.

I think iff we'd want this feature, you'd have to actually use a much
larger hammer, and change the snapshot logic to include information
about which aborted transactions are visible, and whose rows cannot be
removed. And then vacuuming/hot pruning need to be changed to respect
that. And note that'll affect *all* sessions, not just the one wanting
to use READ UNCOMMITTED.

Greetings,

Andres Freund

#32

Craig Ringer

craig@2ndquadrant.com

about 6 years ago

In reply to: Andres Freund (#30)

Re: Read Uncommitted

On Thu, 19 Dec 2019 at 23:36, Andres Freund <andres@anarazel.de> wrote:

Hi,

On the patch as proposed this wouldn't be possible because a toast row
can't be vacuumed and then reused while holding back xmin, at least as I
understand it.

Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.

My understanding from reading the above is that Simon didn't propose to
make aborted txns visible, only in-progress uncommitted txns.

Vacuum only removes such rows if the xact is (a) explicitly aborted in clog
or (b) provably not still running. It checks RecentXmin and the running
xids arrays to handle xacts that went away after a crash. Per
TransactionIdIsInProgress() as used by HeapTupleSatisfiesVacuum(). I see
that it's not *quite* as simple as using the RecentXmin threhold, as xacts
newer than RecentXmin may also be seen as not in-progress if they're absent
in the shmem xact arrays and there's no overflow.

But that's OK so long as the only xacts that some sort of read-uncommitted
feature allows to become visible are ones that
satisfy TransactionIdIsInProgress(); they cannot have been vacuumed.

The bigger issue, and the one that IMO makes it impractical to spell this
as "READ UNCOMMITTED", is that an uncommitted txn might've changed the
catalogs so there is no one snapshot that is valid for interpreting all
possible tuples. It'd have to see only txns that have no catalog changes,
or be narrowed to see just *one specific txn* that had catalog changes.
That makes it iffy to spell it as "READ UNCOMMITTED" since we can't
actually make all uncommitted txns visible at once.

I think the suggestions for a SRF based approach might make sense. Perhaps
a few functions:

* a function to list all in-progress xids

* a function to list in-progress xids with/without catalog changes (if
possible, unsure if we know that until the commit record is written)

* a function (or procedure?) to execute a read-only SELECT or WITH query
within a faked-up snapshot for some in-progress xact and return a SETOF
RECORD with results. If the txn has catalog changes this would probably
have to coalesce each result field with non-builtin data types to text, or
do some fancy validation to compare the definition in the txn snapshot with
the latest normal snapshot used by the calling session. Ideally this
function could take an array of xids and would query with them all visible
unless there were catalog changes in any of them, then it'd ERROR.

* a function to generate the SQL text for an alias clause that maps the
RECORD returned by the above function, so you can semi-conveniently query
it. (I don't think we have a way for a C callable function to supply a
dynamic resultset type at plan-time to avoid the need for this, do we?
Perhaps if we use a procedure not a function?)

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

#33

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Craig Ringer (#32)

Re: Read Uncommitted

Craig Ringer <craig@2ndquadrant.com> writes:

My understanding from reading the above is that Simon didn't propose to
make aborted txns visible, only in-progress uncommitted txns.

Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.

The bigger issue, and the one that IMO makes it impractical to spell this
as "READ UNCOMMITTED", is that an uncommitted txn might've changed the
catalogs so there is no one snapshot that is valid for interpreting all
possible tuples.

In theory that should be okay, because any such tuples would be in
tables you can't access due to the in-progress txn having taken
AccessExclusiveLock on tables it changes the rowtype of. But we keep
looking for ways to reduce the locking requirements for ALTER TABLE
actions --- and as I said upthread, it feels like this feature might
someday be the sole reason why we can't safely reduce lock strength
for some form of ALTER. I can't make a concrete argument for that
though ... maybe it's not really any worse than the situation just
after failure of any DDL-performing txn. But that would bear closer
study than I think it's had.

I think the suggestions for a SRF based approach might make sense.

Yeah, I'd rather do it that way than via a transaction isolation
level. The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.

regards, tom lane

#34

Craig Ringer

craig@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#33)

Re: Read Uncommitted

On Fri, 20 Dec 2019 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:

My understanding from reading the above is that Simon didn't propose to
make aborted txns visible, only in-progress uncommitted txns.

Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.

Absolutely. Many of the same issues arise in the work on logical decoding
of in-progress xacts for optimistic logical decoding.

Unless such an interlock is added (with all the problems that entails,
again per the in-progress logical decoding thread) that limits this to:

* running in recovery when stopped at a recovery target; or
* peeking at the contents of individual prepared xacts that we can prevent
someone else concurrently aborting/committing

That'd actually cover the only things I'd personally actually want a
feature like this for anyway.

In any case, Simon's yanked the proposal. I'd like to have some way to peek
at the contents of individual uncommited xacts, but it's clearly not going
to be anything called READ UNCOMMITTED that applies to all uncommitted
xacts at once...

I think the suggestions for a SRF based approach might make sense.

Yeah, I'd rather do it that way than via a transaction isolation
level. The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.

Yeah. Definitely not an isolation level.

I'll be interesting to see if this sparks any more narrowly scoped and
targeted ideas, anyway. Thanks for taking the time to think about it.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise