table_tuple_lock's snapshot argument

Started by Heikki Linnakangasabout 1 year ago7 messages
#1Heikki Linnakangas
hlinnaka@iki.fi

/*
* Lock a tuple in the specified mode.
*
* Input parameters:
* relation: relation containing tuple (caller must hold suitable lock)
* tid: TID of tuple to lock
* snapshot: snapshot to use for visibility determinations
* cid: current command ID (used for visibility test, and stored into
* tuple's cmax if lock is successful)
* mode: lock mode desired
* wait_policy: what to do if tuple lock is not available
* flags:
* If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to
* also lock descendant tuples if lock modes don't conflict.
* If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
* latest version.
*
* Output parameters:
* *slot: contains the target tuple
* *tmfd: filled in failure cases (see below)
*
* Function result may be:
* TM_Ok: lock was successfully acquired
* TM_Invisible: lock failed because tuple was never visible to us
* TM_SelfModified: lock failed because tuple updated by self
* TM_Updated: lock failed because tuple updated by other xact
* TM_Deleted: lock failed because tuple deleted by other xact
* TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
*
* In the failure cases other than TM_Invisible and TM_Deleted, the routine
* fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
* comments for struct TM_FailureData for additional info.
*/
static inline TM_Result
table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
LockWaitPolicy wait_policy, uint8 flags,
TM_FailureData *tmfd)

What are the semantics of the 'snapshot' argument? In the heapam
implementation, it's not used for anything. What visibility checks the
function might do in a different implementation? I vaguely remember that
the idea was that the TID might not be sufficient to uniquely identify
the row version in something like zheap, which updates the row in place.
In that case, all the different row versions are represented by the same
TID, and the snapshot identifies the version.

There are a few callers of table_tuple_lock:

1. trigger.c: GetTupleForTrigger
2. nodeModifyTable.c
3. nodeLockRows.c
4. execReplication.c

The first three callers pass the EState's snapshot, the same that was
used in a table or index scan that returned the TID. That makes sense.
But the calls in execReplication.c look fishy:

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
outslot,
GetCurrentCommandId(false),
lockmode,
LockWaitBlock,
0 /* don't follow updates */ ,
&tmfd);

PopActiveSnapshot();

if (should_refetch_tuple(res, &tmfd))
goto retry;

Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
I think the idea was to push the latest snapshot and use the same
snapshot in the call to table_tuple_lock(). But because each call to
GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
the active snapshot and passes a *different* snapshot to
table_tuple_lock(). This went wrong in commit 5db6df0c01, which
introduced the update/delete/insert/lock table AM interface. The
argument to table_tuple_lock() was supposed to be GetActiveSnapshot().

However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but let's
assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot is
needed, surely it should be acquired before the index scan, and used for
the index scan as well.

I see three options:

1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions, this
will need to be revisited.

2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.

3. Modify the tableam interface so that the index scan can return a more
unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
the token you got index_getnext_slot().

Thoughts?

--
Heikki Linnakangas
Neon (https://neon.tech)

#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: table_tuple_lock's snapshot argument

Hi!

On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

/*
* Lock a tuple in the specified mode.
*
* Input parameters:
* relation: relation containing tuple (caller must hold suitable lock)
* tid: TID of tuple to lock
* snapshot: snapshot to use for visibility determinations
* cid: current command ID (used for visibility test, and stored into
* tuple's cmax if lock is successful)
* mode: lock mode desired
* wait_policy: what to do if tuple lock is not available
* flags:
* If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to
* also lock descendant tuples if lock modes don't conflict.
* If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
* latest version.
*
* Output parameters:
* *slot: contains the target tuple
* *tmfd: filled in failure cases (see below)
*
* Function result may be:
* TM_Ok: lock was successfully acquired
* TM_Invisible: lock failed because tuple was never visible to us
* TM_SelfModified: lock failed because tuple updated by self
* TM_Updated: lock failed because tuple updated by other xact
* TM_Deleted: lock failed because tuple deleted by other xact
* TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
*
* In the failure cases other than TM_Invisible and TM_Deleted, the routine
* fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
* comments for struct TM_FailureData for additional info.
*/
static inline TM_Result
table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
LockWaitPolicy wait_policy, uint8 flags,
TM_FailureData *tmfd)

What are the semantics of the 'snapshot' argument? In the heapam
implementation, it's not used for anything. What visibility checks the
function might do in a different implementation? I vaguely remember that
the idea was that the TID might not be sufficient to uniquely identify
the row version in something like zheap, which updates the row in place.
In that case, all the different row versions are represented by the same
TID, and the snapshot identifies the version.

There are a few callers of table_tuple_lock:

1. trigger.c: GetTupleForTrigger
2. nodeModifyTable.c
3. nodeLockRows.c
4. execReplication.c

The first three callers pass the EState's snapshot, the same that was
used in a table or index scan that returned the TID. That makes sense.
But the calls in execReplication.c look fishy:

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
outslot,
GetCurrentCommandId(false),
lockmode,
LockWaitBlock,
0 /* don't follow updates */ ,
&tmfd);

PopActiveSnapshot();

if (should_refetch_tuple(res, &tmfd))
goto retry;

Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
I think the idea was to push the latest snapshot and use the same
snapshot in the call to table_tuple_lock(). But because each call to
GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
the active snapshot and passes a *different* snapshot to
table_tuple_lock(). This went wrong in commit 5db6df0c01, which
introduced the update/delete/insert/lock table AM interface. The
argument to table_tuple_lock() was supposed to be GetActiveSnapshot().

However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but let's
assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot is
needed, surely it should be acquired before the index scan, and used for
the index scan as well.

I see three options:

1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions, this
will need to be revisited.

2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.

3. Modify the tableam interface so that the index scan can return a more
unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
the token you got index_getnext_slot().

Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.

------
Regards,
Alexander Korotkov
Supabase

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#2)
Re: table_tuple_lock's snapshot argument

On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

/*
* Lock a tuple in the specified mode.
*
* Input parameters:
* relation: relation containing tuple (caller must hold suitable lock)
* tid: TID of tuple to lock
* snapshot: snapshot to use for visibility determinations
* cid: current command ID (used for visibility test, and stored into
* tuple's cmax if lock is successful)
* mode: lock mode desired
* wait_policy: what to do if tuple lock is not available
* flags:
* If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to
* also lock descendant tuples if lock modes don't conflict.
* If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
* latest version.
*
* Output parameters:
* *slot: contains the target tuple
* *tmfd: filled in failure cases (see below)
*
* Function result may be:
* TM_Ok: lock was successfully acquired
* TM_Invisible: lock failed because tuple was never visible to us
* TM_SelfModified: lock failed because tuple updated by self
* TM_Updated: lock failed because tuple updated by other xact
* TM_Deleted: lock failed because tuple deleted by other xact
* TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
*
* In the failure cases other than TM_Invisible and TM_Deleted, the routine
* fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
* comments for struct TM_FailureData for additional info.
*/
static inline TM_Result
table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
LockWaitPolicy wait_policy, uint8 flags,
TM_FailureData *tmfd)

What are the semantics of the 'snapshot' argument? In the heapam
implementation, it's not used for anything. What visibility checks the
function might do in a different implementation? I vaguely remember that
the idea was that the TID might not be sufficient to uniquely identify
the row version in something like zheap, which updates the row in place.
In that case, all the different row versions are represented by the same
TID, and the snapshot identifies the version.

There are a few callers of table_tuple_lock:

1. trigger.c: GetTupleForTrigger
2. nodeModifyTable.c
3. nodeLockRows.c
4. execReplication.c

The first three callers pass the EState's snapshot, the same that was
used in a table or index scan that returned the TID. That makes sense.
But the calls in execReplication.c look fishy:

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
outslot,
GetCurrentCommandId(false),
lockmode,
LockWaitBlock,
0 /* don't follow updates */ ,
&tmfd);

PopActiveSnapshot();

if (should_refetch_tuple(res, &tmfd))
goto retry;

Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
I think the idea was to push the latest snapshot and use the same
snapshot in the call to table_tuple_lock(). But because each call to
GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
the active snapshot and passes a *different* snapshot to
table_tuple_lock(). This went wrong in commit 5db6df0c01, which
introduced the update/delete/insert/lock table AM interface. The
argument to table_tuple_lock() was supposed to be GetActiveSnapshot().

However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but let's
assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot is
needed, surely it should be acquired before the index scan, and used for
the index scan as well.

I see three options:

1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions, this
will need to be revisited.

2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.

3. Modify the tableam interface so that the index scan can return a more
unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
the token you got index_getnext_slot().

Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.

To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.

------
Regards,
Alexander Korotkov
Supabase

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Korotkov (#3)
Re: table_tuple_lock's snapshot argument

On 21/01/2025 12:05, Alexander Korotkov wrote:

On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but let's
assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot is
needed, surely it should be acquired before the index scan, and used for
the index scan as well.

I see three options:

1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions, this
will need to be revisited.

2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.

3. Modify the tableam interface so that the index scan can return a more
unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
the token you got index_getnext_slot().

Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.

To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.

I committed and backpatched the trivial fix for not, just replacing
GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just
removing the argument without providing a replacement, because it's not
100% clear to me if the current situation is broken or not.

To summarize the issue, RelationFindReplTupleByIndex() performs an index
scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh
MVCC snapshot to lock the row. There's no guarantee that the index scan
found the same row version that table_tuple_lock() will lock, if the TID
alone doesn't uniquely identify it. But that's still OK as long as the
key column hasn't changed, like with heapam's HOT updates. I couldn't
convince myself that it's broken nor that it's guaranteed to be correct
with other AMs.

Thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

#5Ranier Vilela
ranier.vf@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: table_tuple_lock's snapshot argument

Em seg., 10 de mar. de 2025 às 12:22, Heikki Linnakangas <hlinnaka@iki.fi>
escreveu:

On 21/01/2025 12:05, Alexander Korotkov wrote:

On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi>

wrote:

However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but

let's

assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot

is

needed, surely it should be acquired before the index scan, and used

for

the index scan as well.

I see three options:

1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions,

this

will need to be revisited.

2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.

3. Modify the tableam interface so that the index scan can return a

more

unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID,

pass

the token you got index_getnext_slot().

Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.

To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.

I committed and backpatched the trivial fix for not, just replacing
GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just
removing the argument without providing a replacement, because it's not
100% clear to me if the current situation is broken or not.

To summarize the issue, RelationFindReplTupleByIndex() performs an index
scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh
MVCC snapshot to lock the row. There's no guarantee that the index scan
found the same row version that table_tuple_lock() will lock, if the TID
alone doesn't uniquely identify it. But that's still OK as long as the
key column hasn't changed, like with heapam's HOT updates. I couldn't
convince myself that it's broken nor that it's guaranteed to be correct
with other AMs.

Just for curiosity.
*RelationFindReplTupleSeq* on the same source, does not suffer
from the same issue?

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),

best regards,
Ranier Vilela

#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ranier Vilela (#5)
Re: table_tuple_lock's snapshot argument

On 10/03/2025 18:20, Ranier Vilela wrote:

Just for curiosity.
*RelationFindReplTupleSeq* on the same source, does not suffer
from the same issue?

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),

:facepalm: yep, it sure does, and FindConflictTuple() too. Thanks, I'll
go fix those too.

--
Heikki Linnakangas
Neon (https://neon.tech)

#7Ranier Vilela
ranier.vf@gmail.com
In reply to: Heikki Linnakangas (#6)
Re: table_tuple_lock's snapshot argument

Em seg., 10 de mar. de 2025 às 13:52, Heikki Linnakangas <hlinnaka@iki.fi>
escreveu:

On 10/03/2025 18:20, Ranier Vilela wrote:

Just for curiosity.
*RelationFindReplTupleSeq* on the same source, does not suffer
from the same issue?

PushActiveSnapshot(GetLatestSnapshot());

res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),

:facepalm: yep, it sure does, and FindConflictTuple() too. Thanks, I'll
go fix those too.

Thanks.

best regards,
Ranier Vilela