Fix missing EvalPlanQual recheck for TID scans
Hi all,
I learned that the TID Scan node does not properly implement EvalPlanQual recheck, which has already been noted in comment added in 2002 (commit 6799a6ca21). This does seem to have the potential to manifest in observable and surprising ways when queries are filtering on `ctid`, like for optimistic concurrency control as in https://stackoverflow.com/q/78487441.
Attached is an attempt from me at fixing this issue for TID Scan as well as TID Range Scan which appears to have the same issue.
This is my first time looking at the Postgres source (nor do I work with C on a regular basis), so I will not be surprised to hear that I've done things wrong, but I'm hopeful that this is usable as written. Running `meson test` passes for me, including the added isolation tests, and I confirmed that both of the tests that now result in a negative recheck were previously failing.
In my implementation I am calling TidRangeEval every time TidRangeRecheck is called; let me know if this is a performance concern. It didn't seem that any of the existing TidRangeScanState flags are quite right to know if the bounds have been initialized, but I am happy to add one.
The original issue appears to have been present since the introduction of TID scans in 1999 (commit 6f9ff92cc0) so I imagine it may make sense to backport the fix, although evidently not many people are running into this.
Thanks,
Sophie
Attachments:
0001-Fix-missing-EvalPlanQual-recheck-for-TID-scans.v1.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Fix-missing-EvalPlanQual-recheck-for-TID-scans.v1.patch?="Download+99-8
On Mon, 8 Sept 2025 at 10:31, Sophie Alpert <pg@sophiebits.com> wrote:
I learned that the TID Scan node does not properly implement EvalPlanQual recheck, which has already been noted in comment added in 2002 (commit 6799a6ca21). This does seem to have the potential to manifest in observable and surprising ways when queries are filtering on `ctid`, like for optimistic concurrency control as in https://stackoverflow.com/q/78487441.
Thanks for the report and the patch.
I'll summarise this issue, as you provided an external link which
might one day disappear:
drop table if exists test_optimistic_lock;
create table public.test_optimistic_lock (
id bigserial primary key,
name varchar,
rest_count integer,
update_user varchar,
update_time timestamp without time zone,
create_time timestamp without time zone
);
insert into public.test_optimistic_lock (name, rest_count,
update_time, create_time) values
('Product1', 1, current_timestamp, current_timestamp);
S1: begin;
S1: select ctid, rest_count from public.test_optimistic_lock where
name = 'Product1';
S2: begin;
S2: select ctid, rest_count from public.test_optimistic_lock where
name = 'Product1';
S1: update public.test_optimistic_lock set rest_count = rest_count -
1, update_user = 'A', update_time = current_timestamp where name =
'Product1' and ctid = '(0,1)';
S2: update public.test_optimistic_lock set rest_count = rest_count -
1, update_user = 'A', update_time = current_timestamp where name =
'Product1' and ctid = '(0,1)'; -- waits for S1
S1: commit:
S2: Gets UPDATE 1 when it should get UPDATE 0!!
I agree that this is a bug and we should fix it. The behaviour should
match what you'd get if you ran it with "SET enable_tidscan = 0;".
When that's done, the Seq Scan will have the ctid related expressions
within the qual and it will correctly filter out the non-matching
tuple.
Attached is an attempt from me at fixing this issue for TID Scan as well as TID Range Scan which appears to have the same issue.
I think this is generally pretty good. Here's a quick review:
1. This part is quite annoying:
+ if (node->tss_TidList == NULL)
+ TidListEval(node);
Of course, it's required since ExecReScanTidScan() wiped out that list.
Maybe we can add a boolean variable named tts_inScan to TidScanState
after tss_isCurrentOf (so as it consumes the existing padding bytes
and does not enlarge that struct), and have ExecReScanTidScan() set
that to false rather than wipe out the tss_TidList. Then just reuse
the existing list in TidRecheck(). That should save on the extra
overhead of having to rebuild the list each time there's an EPQ
recheck.
2. For TidRangeRecheck, I don't see why this part is needed:
+ if (!TidRangeEval(node))
+ return false;
The TID range is preserved already and shouldn't need to be recalculated.
3. In TidRecheck(), can you make "match" an "ItemPointer" and do:
match = (ItemPointer) bsearch(...);
4. Can you put this comment above the "if"?
+ if (node->tss_isCurrentOf)
+ /* WHERE CURRENT OF always intends to resolve to the latest tuple */
+ return true;
What you've got there doesn't meet the style we use. Technically there
should be braces because there are multiple lines (per our style (not
C)), but it's a bit needless to do that as the comment makes the same
amount of sense if it goes before the "if".
5. Can you make TidRangeRecheck() look like this?
static bool
TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot)
{
/* Recheck the ctid is still within range */
if (ItemPointerCompare(&slot->tts_tid, &node->trss_mintid) < 0 ||
ItemPointerCompare(&slot->tts_tid, &node->trss_maxtid) > 0)
return false;
return true;
}
The reason being that it's closer to how heap_getnextslot_tidrange()
does it, and I'd rather the conditions were kept as similar as
possible to that function.
6. For the tests. It should be ok to make the Tid range scan test do
ctid BETWEEN '(0,1)' AND '(0,1)'. I feel this is more aligned to the
TID Range Scan version of what you're doing in the TID Scan test.
I'll need to give the tests a closer inspection later.
David
Thanks for taking a look.
On Sun, Sep 7, 2025 at 9:51 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I agree that this is a bug and we should fix it. The behaviour should
match what you'd get if you ran it with "SET enable_tidscan = 0;".
I've confirmed that my new tests already pass on current master if `SET enable_tidscan = off;` is added to both session setups.
1. This part is quite annoying:
+ if (node->tss_TidList == NULL)
+ TidListEval(node);Of course, it's required since ExecReScanTidScan() wiped out that list.
Maybe we can add a boolean variable named tts_inScan to TidScanState
after tss_isCurrentOf (so as it consumes the existing padding bytes
and does not enlarge that struct), and have ExecReScanTidScan() set
that to false rather than wipe out the tss_TidList. Then just reuse
the existing list in TidRecheck(). That should save on the extra
overhead of having to rebuild the list each time there's an EPQ
recheck.2. For TidRangeRecheck, I don't see why this part is needed:
+ if (!TidRangeEval(node))
+ return false;The TID range is preserved already and shouldn't need to be recalculated.
I am a bit unclear on the exact logic around what order Next, ReScan, and Recheck are called in, so I may have written the code too defensively. Concretely: For ExecReScanTidScan wiping out tss_TidList, I was thinking that when ExecReScanTidScan is called, there may be changed params that require the list to be recalculated. I also wasn't sure if Recheck can be called without a preceding Next and/or ReScan having been called with the same params, or if it can always rely on those having been called immediately prior. (I had reviewed IndexRecheck as I figured it seems the most likely to be a correct example here, but I wasn't able to infer from how it's written.)
If you could help clarify the guarantees here I'd appreciate it. For what it's worth, if I test removing the tss_TidList clearing in ExecReScanTidScan and recomputation in TidRecheck (of course this is not correct in general), then my `tid1 tidsucceed2 c1 c2` test now fails due to TidRecheck incorrectly returning false.
I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to create an entirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls? If I'm understandign correctly, then it seems the best we can do in this patch is to leave TidRecheck as I had it but for TidRangeRecheck I can add a node->trss_boundsInitialized flag, so that we recompute it once per EPQ instead of per tuple checked.
(I'll be happy to make the stylistic changes you mentioned.)
Appreciate your time,
Sophie
On Mon, 8 Sept 2025 at 18:57, Sophie Alpert <pg@sophiebits.com> wrote:
I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to create an entirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls?
Oh, you're right. The EPQ executor start isn't the same as the normal
executor state, so the recalc seems to be needed.
David
Updated patch attached.
On Sun, Sep 7, 2025 at 9:51 PM, David Rowley <dgrowleyml@gmail.com> wrote:
1. This part is quite annoying:
+ if (node->tss_TidList == NULL)
+ TidListEval(node);Of course, it's required since ExecReScanTidScan() wiped out that list.
Given that EPQ uses separate PlanState, I've left this as is.
2. For TidRangeRecheck, I don't see why this part is needed:
+ if (!TidRangeEval(node))
+ return false;The TID range is preserved already and shouldn't need to be recalculated.
I've added a new trss_boundsInitialized flag such that we calculate the range once per EPQ rescan. In order to preserve the semantics when the min or max is NULL, I'm setting trss_mintid/trss_maxtid to have invalid ItemPointers as a sentinel in the cases where TidRangeEval returns false. I added a ItemPointerIsValid assertion given that it's now more relevant to correctness but I can remove it if it feels superfluous. Let me know if there is a more idiomatic way to treat this.
3. In TidRecheck(), can you make "match" an "ItemPointer" and do:
match = (ItemPointer) bsearch(...);
4. Can you put this comment above the "if"?
5. Can you make TidRangeRecheck() look like this?
6. For the tests. It should be ok to make the Tid range scan test do
ctid BETWEEN '(0,1)' AND '(0,1)'. I feel this is more aligned to the
TID Range Scan version of what you're doing in the TID Scan test.
All done.
Do let me know if other changes would be helpful.
Sophie
Attachments:
0001-Fix-missing-EvalPlanQual-recheck-for-TID-scans.v2.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Fix-missing-EvalPlanQual-recheck-for-TID-scans.v2.patch?="Download+130-8
Hi Sophie,
Thanks for the fix. We do have a lot of applications that use ctid to update/delete rows for performance considerations.
On Sep 8, 2025, at 17:15, David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 8 Sept 2025 at 18:57, Sophie Alpert <pg@sophiebits.com> wrote:
I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to create an entirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls?
Oh, you're right. The EPQ executor start isn't the same as the normal
executor state, so the recalc seems to be needed.David
Following the reproduce procedure that David provided, I tested and traced the fix, basically it works for me.
However, I noticed that, when “TidRecheck()” is called, it is actually passed with “epqstate->recheckplanstate” that is always newly created, which means we repeatedly call “TidListEval(node)” and run “bsearch()” when there are multiple row involved in the update. If update qual is just “ctid = <a single ctid>”, that should fine. But if we do “update .. where ctid in (a list of ctid)”, then duplicately call “TidListEval()” might not be a good thing.
As “TidRecheck(TidScanState *node, TupleTableSlot *slot)” gets the second parameter “slot” with the updated slot, so I am thinking the other solution could be that, we just check if the updated slot is visible to current transaction. So I made a local change like this:
```
static bool
TidRecheck(TidScanState *node, TupleTableSlot *slot)
{
HeapTuple tuple;
bool visible;
if (node->tss_isCurrentOf)
/* WHERE CURRENT OF always intends to resolve to the latest tuple */
return true;
tuple = ExecCopySlotHeapTuple(slot);
visible = HeapTupleSatisfiesVisibility(tuple, GetActiveSnapshot(), slot->tts_tableOid);
return visible;
}
```
This seems also work for me. WDYT?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi Chao,
Thanks for taking a look.
On Tue, Sep 9, 2025 at 12:52 AM, Chao Li <li.evan.chao@gmail.com> wrote:
However, I noticed that, when “TidRecheck()” is called, it is actually passed with “epqstate->recheckplanstate” that is always newly created, which means we repeatedly call “TidListEval(node)” and run “bsearch()” when there are multiple row involved in the update. If update qual is just “ctid = <a single ctid>”, that should fine. But if we do “update .. where ctid in (a list of ctid)”, then duplicately call “TidListEval()” might not be a good thing.
Based on my current understanding, the EPQ PlanState is shared across all EPQ
checks for a given plan and that ReScan is called at similar times as it is for
the initial access path with, ExecScanFetch delegating to the appropriate *Next
or *Recheck method as each row is processed. Therefore, my patch does not
recompute the TidList for every row.
As “TidRecheck(TidScanState *node, TupleTableSlot *slot)” gets the second parameter “slot” with the updated slot, so I am thinking the other solution could be that, we just check if the updated slot is visible to current transaction. So I made a local change like this:
...
tuple = ExecCopySlotHeapTuple(slot);
visible = HeapTupleSatisfiesVisibility(tuple, GetActiveSnapshot(), slot->tts_tableOid);
This is semantically different from the patch I've proposed. My test
`permutation tid1 tidsucceed2 c1 c2 read` fails against your code, despite
passing with the `SET enable_tidscan = OFF;` flag that I understand we want to
match the behavior of. (In addition, I understand HeapTupleSatisfiesVisibility
is specific to the heap access method and can't be used here, though perhaps
tuple_fetch_row_version could be used instead.)
Sophie
On Sep 10, 2025, at 02:20, Sophie Alpert <pg@sophiebits.com> wrote:
Hi Chao,
Thanks for taking a look.
On Tue, Sep 9, 2025 at 12:52 AM, Chao Li <li.evan.chao@gmail.com> wrote:
However, I noticed that, when “TidRecheck()” is called, it is actually passed with “epqstate->recheckplanstate” that is always newly created, which means we repeatedly call “TidListEval(node)” and run “bsearch()” when there are multiple row involved in the update. If update qual is just “ctid = <a single ctid>”, that should fine. But if we do “update .. where ctid in (a list of ctid)”, then duplicately call “TidListEval()” might not be a good thing.
Based on my current understanding, the EPQ PlanState is shared across all EPQ
checks for a given plan and that ReScan is called at similar times as it is for
the initial access path with, ExecScanFetch delegating to the appropriate *Next
or *Recheck method as each row is processed. Therefore, my patch does not
recompute the TidList for every row.
No, that’s not true. If you extend David’s procedure and use 3 sessions to reproduce the problem, s1 update (0,1), s2 update (0,2) and s3 update (0,1) or (0,2), and trace the backend process of s3, you will see every time when TidRecheck() is called, “node” is brand new, so it has to call TidListEval(node) every time.
As “TidRecheck(TidScanState *node, TupleTableSlot *slot)” gets the second parameter “slot” with the updated slot, so I am thinking the other solution could be that, we just check if the updated slot is visible to current transaction. So I made a local change like this:
...
tuple = ExecCopySlotHeapTuple(slot);
visible = HeapTupleSatisfiesVisibility(tuple, GetActiveSnapshot(), slot->tts_tableOid);This is semantically different from the patch I've proposed. My test
`permutation tid1 tidsucceed2 c1 c2 read` fails against your code, despite
passing with the `SET enable_tidscan = OFF;` flag that I understand we want to
match the behavior of. (In addition, I understand HeapTupleSatisfiesVisibility
is specific to the heap access method and can't be used here, though perhaps
tuple_fetch_row_version could be used instead.)
I think TidScan only applies to HeapTable, so we can use HeapTupleSatisfiesVisibility(0 in TidRecheck().
With my proposal, my theory is that, TidNext() fetches a tuple by ctid, if other concurrent transaction update/delete the tuple, the tuple's visibility changes.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sep 10, 2025, at 13:18, Chao Li <li.evan.chao@gmail.com> wrote:
With my proposal, my theory is that, TidNext() fetches a tuple by ctid, if other concurrent transaction update/delete the tuple, the tuple's visibility changes.
I did some more debugging on this issue today, and I withdraw my previous proposal of checking visibility.
I can confirm that, every time when TidRecheck() is called, “node” is brand new, so the current patch do duplicately calculate TidListEval().
But, why can't we make TidRecheck() to simplify return FALSE?
Looks like the only case where TidRecheck() is called is a concurrent transaction upadated the row, and in that case, ctid have must be changed.
The following case WON’T call TidRecheck():
Case 1. Concurrent delete
=====
S1:
Begin;
delete t where ctid = ‘(0,1)’
S2:
Update t set something where ctid = ‘(0,1)’; // block
S1:
Commit;
S2 will just fail to update the row because the row has been deleted by s1.
Case 2: select for update/share
======
S1:
Begin;
Select * from t where ctid = ‘(0,1)’ for share;
S2:
Update t set something where ctid = ‘(0,1)’; // block
S1:
Commit;
S2 will just successfully update the row, because s1 release the lock and the row is still there.
Case 3: join update
======
S1:
Begin;
Update t2 set something where id = 1;
S2:
Update t2 set something where id = (select id from t where ctid = ‘(0, 1)’); // block
S1: commit
S2 will successfully update t2’s row. In this process, index scan’t recheck against t2 will be called, TidRecheck() against t will not be called.
======
Unless I missed some cases, we can simply return FALSE from TidRecheck().
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, Sep 9, 2025 at 10:18 PM, Chao Li <li.evan.chao@gmail.com> wrote:
No, that’s not true. If you extend David’s procedure and use 3 sessions to reproduce the problem, s1 update (0,1), s2 update (0,2) and s3 update (0,1) or (0,2), and trace the backend process of s3, you will see every time when TidRecheck() is called, “node” is brand new, so it has to call TidListEval(node) every time.
After more investigation I agree you are correct here; in a single query, ReScan is called once for each subsequent tuple being rechecked. (ExecUpdate calls EvalPlanQualBegin which always sets rcplanstate->chgParam.)
On Wed, Sep 10, 2025 at 10:30 PM, Chao Li <li.evan.chao@gmail.com> wrote:
But, why can't we make TidRecheck() to simplify return FALSE?
Looks like the only case where TidRecheck() is called is a concurrent transaction upadated the row, and in that case, ctid have must be changed.
Although returning FALSE is clever, it is only correct if several other unstated preconditions for the function hold, which I am loath to rely on.
And indeed, like I mentioned in my previous message, my isolation test `permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.spec in my patch will fail if Recheck were to return false in this case. Though somewhat contrived, you can imagine this happening with multiple sessions driven by the same application:
setup: two rows exist with ctid=(0,1) and (0,2)
S1: BEGIN;
S2: BEGIN;
S1: UPDATE WHERE ctid=(0,1) RETURNING ctid;
-- returns (0,3), which the application uses in the next query from another session:
S2: UPDATE WHERE ctid=(0,1) OR ctid=(0,3); -- statement snapshot sees (0,1); recheck will see (0,3) and should pass
S1: COMMIT;
S2: COMMIT;
I would not defend this as good code from an application developer but the behavior is observable. So I understand it would be best to match the enable_tidscan = off behavior, which my existing strategy more verifiably does. Of course if the team disagrees with me then I will defer to everyone's better judgement.
I hope it is rare that many tuples need to be rechecked in a single query but if this is common, then I would suggest it would be better to rework the generic EPQ machinery so that EPQ nodes do not need to be not reset for every row, rather than depending on subtle implicit guarantees in the TidRecheck code.
Sophie
On Sat, Sep 13, 2025 at 3:12 PM, Sophie Alpert <pg@sophiebits.com> wrote:
And indeed, like I mentioned in my previous message, my isolation test
`permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.spec in
my patch will fail if Recheck were to return false in this case. Though
somewhat contrived, you can imagine this happening with multiple
sessions driven by the same application:
Another case where returning FALSE does not give the correct behavior is when two relations are involved, only one of which is modified:
S1: BEGIN;
S2: BEGIN;
S1: UPDATE accounts SET balance = balance + 100 WHERE ctid = '(0,1)' RETURNING accountid, balance;
S2: SELECT * FROM accounts JOIN accounts_ext USING (accountid) WHERE accounts_ext.ctid = '(0,1)' FOR UPDATE OF accounts;
S1: COMMIT;
S2: COMMIT;
In my patch the S2 query correctly returns one row, whereas with your proposed change it incorrectly returns none.
accountid|balance|balance2|balance|other|newcol|newcol2
---------+-------+--------+-------+-----+------+-------
checking | 700| 1400| 600|other| 42|
Sophie
On Sep 14, 2025, at 06:12, Sophie Alpert <pg@sophiebits.com> wrote:
And indeed, like I mentioned in my previous message, my isolation test `permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.spec in my patch will fail if Recheck were to return false in this case. Though somewhat contrived, you can imagine this happening with multiple sessions driven by the same application:
setup: two rows exist with ctid=(0,1) and (0,2)
S1: BEGIN;
S2: BEGIN;
S1: UPDATE WHERE ctid=(0,1) RETURNING ctid;
-- returns (0,3), which the application uses in the next query from another session:
S2: UPDATE WHERE ctid=(0,1) OR ctid=(0,3); -- statement snapshot sees (0,1); recheck will see (0,3) and should pass
S1: COMMIT;
S2: COMMIT;I would not defend this as good code from an application developer but the behavior is observable. So I understand it would be best to match the enable_tidscan = off behavior, which my existing strategy more verifiably does. Of course if the team disagrees with me then I will defer to everyone's better judgement.
This is a wrong example and (0,3) should NOT be updated. According to the definition of “read committed”:
https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
13.2. Transaction Isolation
postgresql.org
“A query sees only data committed before the query began”.
In this example, the update is s1 (0,3) is committed after s2’s update, so s2’s update should not see (0,3).
This behavior can be easily proved with a simple example with master branch:
S1:
```
evantest=# select * from t;
id | a | b
----+---+----
1 | 5 | 20
(1 row)
evantest=# begin;
BEGIN
evantest=*# update t set b=30 where id = 1;
UPDATE 1
evantest=*# insert into t values (2, 3, 4); # this simulate (0,3) in your example
INSERT 0 1
```
S2:
```
evantest=# select * from t;
id | a | b
----+---+----
1 | 5 | 20
(1 row)
evantest=# begin;
BEGIN
evantest=*# update t set b = 30; # block here
```
S1:
```
evantest=*# commit;
COMMIT
```
S2:
```
UPDATE 1
evantest=*# select * from t;
id | a | b
----+---+----
2 | 3 | 4 <=== the newly inserted tuple by s1 is NOT updated
1 | 5 | 30 <=== only the old tuple is updated
(2 rows)
```
This example also proves that your solution of TidListEval() is wrong, it may lead to unexpected update: (0,3) in your example.
It also proves the my proposal of checking visibility should work, because (0,3) is invisible to s2. And my proposal also works for your second example of doing “select for update” in s2, in that case, (0,1). After s1 committed, (0,1) is dead, so select of s2 should return nothing. The behavior can also be easily proved with a non-ctid example:
S1:
```
evantest=# begin;
BEGIN
evantest=*# update t set id=2 where id = 1;
UPDATE 1
```
S2:
```
evantest=# select * from t;
id | a | b
----+---+----
1 | 5 | 30
(1 row)
evantest=# select * from t where id = 1 for update; <=== block here
```
S1:
```
evantest=*# commit;
COMMIT
```
S2:
```
id | a | b
----+---+---
(0 rows) <=== s2 returned nothing, because id=1 is no long valid
```
And this example also proves my solution of checking visibility works.
And from this two examples, always returning FALSE seems to also work. But I am still not 100% sure if there are other use case that returning FALSE may not work. So, feels like checking visibility is a safe solution.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
elephant.pngimage/png; name=elephant.png; x-unix-mode=0666Download
On Sun, Sep 14, 2025 at 6:49 PM, Chao Li <li.evan.chao@gmail.com> wrote:
This is a wrong example and (0,3) should NOT be updated. According to the definition of “read committed”:
https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
“A query sees only data committed before the query began”.
You paraphrased the docs here but did so incorrectly: the actual quote is "a SELECT query (without a FOR UPDATE/SHARE clause) sees only data committed before the query began". We are not discussing the behavior of a plain SELECT query so this description is not relevant. For Update and LockRows, the expected EvalPlanQual behavior is that rows are checked against the predicate twice — once as of the statement snapshot and once as of locking time — and the rows that match both times are used.
In my example with ctid (0,3), the row matches the 'ctid = (0,1) OR ctid = (0,3)' predicate both times. The row is not newly created, so the newly-created row in your example is not analogous.
I continue to believe that my implementation of TidRecheck plainly satisfies the contract for what the scan recheck is meant to do; the fact that it matches the enable_tidscan=OFF behavior is further corroboration of that fact.
Sophie
On Mon, 15 Sept 2025 at 13:49, Chao Li <li.evan.chao@gmail.com> wrote:
This is a wrong example and (0,3) should NOT be updated. According to the definition of “read committed”:
13.2. Transaction Isolation
postgresql.org“A query sees only data committed before the query began”.
I'm afraid your understanding of the documentation isn't correct.
Maybe you've observed some translated version which isn't correct, or
you've made a mistake in your translation back to English.
For the quoted line, I'm not quite sure where that comes from, but I
don't see it as it is in the documents. The particular part of the
documentation that's relevant is in [1]https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED. The following is the
relevant text:
"If the first updater commits, the second updater will ignore the row
if the first updater deleted it, otherwise it will attempt to apply
its operation to the updated version of the row. The search condition
of the command (the WHERE clause) is re-evaluated to see if the
updated version of the row still matches the search condition. If so,
the second updater proceeds with its operation using the updated
version of the row."
And this example also proves my solution of checking visibility works.
The specific part that Sophie aims to fix is the incorrect behaviour
regarding the "re-evaluation" of the updated row. The correct way to
fix that is to ensure the updated tuple matches the WHERE clause of
the statement. Adding some visibility checks in place of that is
complete nonsense. What does need to happen is validation that the
ctid of the updated tuple matches the WHERE clause of the UPDATE
statement. The reason anything special must happen for TID Scan and
TID Range Scan and not Seq Scan is that in Seq Scan the ctid qual will
be part of the scan's qual, whereas in the TID Scan variants, that's
been extracted as it's assumed the scan itself will handle only
getting the required rows. It's just that's currently not enforced
during the recheck.
If you review the ExecScan() comments, you'll see the recheck
function's role is the following:
* A 'recheck method' must also be provided that can check an
* arbitrary tuple of the relation against any qual conditions
* that are implemented internal to the access method.
If you need guidance about how this should be behaving, try with "SET
enable_tidscan = 0;" and see what that does.
David
[1]: https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
Hi David and Sophie,
If you need guidance about how this should be behaving, try with "SET
enable_tidscan = 0;" and see what that does.David
[1] https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
Thanks for pointing out “SET enable_tidsan=0”, with that, yes, (0,3) can be updated.
I was reading the official doc at https://www.postgresql.org/docs/current/transaction-iso.html:
```
Read Committed is the default isolation level in PostgreSQL. When a transaction uses this isolation level, a SELECT query (without a FOR UPDATE/SHARE clause) sees only data committed before the query began; it never sees either uncommitted data or changes committed by concurrent transactions during the query's execution. In effect, a SELECT query sees a snapshot of the database as of the instant the query begins to run. However, SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Also note that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT starts and before the second SELECT starts.
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time.
```
It says that UPDATE will only find target rows that were committed as of the command start time. I think the statement implies that an “update” statement will never update a “future” tuple.
With Sophie’s example, when s2 start “update where ctid=(0,1) or (0,3)”, s1 has not committed yet, so based on the doc, (0,3) should not be updated by s2. But with enable_tidscan off, the implementation actually updated (0,3). Is it a bug of the doc or the implementation of SeqScan as SeqScan also doesn’t have recheck implemented? Or any part of my understanding is wrong?
Also, for the example I put in my previous email, in s1, I inserted a tuple, and s2 didn’t update the inserted row, which complied with the behavior the doc described. So, does PG treat CTID query condition specially? Maybe I missed that part?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sunday, September 14, 2025, Chao Li <li.evan.chao@gmail.com> wrote:
It says that UPDATE will only find target rows that were committed as of
the command start time. I think the statement implies that an “update”
statement will never update a “future” tuple.
It will indeed see a future physical tuple so long as the logical row that
said tuple refers to already was committed, and it found the corresponding
past physical tuple.
Admittedly, I’m unclear on how exactly the system communicates/understands
that the past and future physical tuples refer to same logical row
reliably. In the docs, one is left to assume that feature just works.
David J.
On Sunday, September 14, 2025, Chao Li <li.evan.chao@gmail.com> wrote:
evantest=*# update t set b=30 where id = 1;
evantest=*# update t set b = 30; # block here
```1 | 5 | 30 <=== only the old tuple is updated
Can’t test right now myself but I believe you’ll find this much more
illustrative if you don’t have both S1 and S2 set the same column to the
same value when doing their updates. Also, include ctid in the select
outputs. If the second update would have been a=10 your final output
should be 1,10,30 - I.e. you’d find that both updates applied to id=1
after the second commit finished, and three tuples exist where id=1.
David J.
On Mon, 15 Sept 2025 at 17:32, Chao Li <li.evan.chao@gmail.com> wrote:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time.
It says that UPDATE will only find target rows that were committed as of the command start time. I think the statement implies that an “update” statement will never update a “future” tuple.
I think you've only read the first sentence in that paragraph. Please
read the entire paragraph. You'll see it goes on to explain "it will
attempt to apply its operation to the updated version of the row".
David
On Sep 15, 2025, at 14:27, David Rowley <dgrowleyml@gmail.com> wrote:
I think you've only read the first sentence in that paragraph. Please
read the entire paragraph. You'll see it goes on to explain "it will
attempt to apply its operation to the updated version of the row".
Ah… Sorry for missing the part of the paragraph. Now it’s clear to me.
Then the solution of TidListEval() is correct, and the only issue is repeatedly calling TidListEval() on every recheck.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sep 15, 2025, at 16:12, Chao Li <li.evan.chao@gmail.com> wrote:
Then the solution of TidListEval() is correct, and the only issue is repeatedly calling TidListEval() on every recheck.
I want to add that, the issue of repeatedly calling TidListEval() is not a problem of this patch. I am okay if you decide to defer the problem to a separate patch.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/