Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Started by Kasahara Tatsuhitoalmost 6 years ago25 messages
#1Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com

Hi, I noticed that since PostgreSQL 12, Tid scan increments value of
pg_stat_all_tables.seq_scan. (but not seq_tup_read)

The following is an example.

CREATE TABLE t (c int);
INSERT INTO t SELECT 1;
SET enable_seqscan to off;

(v12 -)
=# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
QUERY PLAN
-------------------------------------------------------------------------------------------
Tid Scan on t (cost=0.00..4.01 rows=1 width=4) (actual
time=0.034..0.035 rows=1 loops=1)
TID Cond: (ctid = '(0,1)'::tid)
Planning Time: 0.341 ms
Execution Time: 0.059 ms
(4 rows)

=# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables WHERE relname = 't';
seq_scan | seq_tup_read
----------+--------------
1 | 0
(1 row)

(- v11)
=# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
QUERY PLAN
-------------------------------------------------------------------------------------------
Tid Scan on t (cost=0.00..4.01 rows=1 width=4) (actual
time=0.026..0.027 rows=1 loops=1)
TID Cond: (ctid = '(0,1)'::tid)
Planning Time: 1.003 ms
Execution Time: 0.068 ms
(4 rows)

postgres=# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables
WHERE relname = 't';
seq_scan | seq_tup_read
----------+--------------
0 | 0
(1 row)

Exactly, this change occurred from following commit.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=147e3722f7e531f15ba389a4d518efe8cd0bd736)
I think, from this commit, TidListEval() came to call
table_beginscan() , so this increments would be happen.

I'm not sure this change whether intention or not, it can confuse some users.

Best regards,
--
NTT Open Source Software Center
Tatsuhito Kasahara

#2Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kasahara Tatsuhito (#1)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi.

Attached patch solve this problem.

This patch adds table_beginscan_tid() and call it in TidListEval()
instead of table_beginscan().
table_beginscan_tid() is the same as table_beginscan() but do not set
SO_TYPE_SEQSCAN to flags.

Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.

Is it worth to apply to HEAD and v12 branch ?

Best regards,

2020年1月27日(月) 14:35 Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com>:

Hi, I noticed that since PostgreSQL 12, Tid scan increments value of
pg_stat_all_tables.seq_scan. (but not seq_tup_read)

The following is an example.

CREATE TABLE t (c int);
INSERT INTO t SELECT 1;
SET enable_seqscan to off;

(v12 -)
=# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
QUERY PLAN
-------------------------------------------------------------------------------------------
Tid Scan on t (cost=0.00..4.01 rows=1 width=4) (actual
time=0.034..0.035 rows=1 loops=1)
TID Cond: (ctid = '(0,1)'::tid)
Planning Time: 0.341 ms
Execution Time: 0.059 ms
(4 rows)

=# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables WHERE relname = 't';
seq_scan | seq_tup_read
----------+--------------
1 | 0
(1 row)

(- v11)
=# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
QUERY PLAN
-------------------------------------------------------------------------------------------
Tid Scan on t (cost=0.00..4.01 rows=1 width=4) (actual
time=0.026..0.027 rows=1 loops=1)
TID Cond: (ctid = '(0,1)'::tid)
Planning Time: 1.003 ms
Execution Time: 0.068 ms
(4 rows)

postgres=# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables
WHERE relname = 't';
seq_scan | seq_tup_read
----------+--------------
0 | 0
(1 row)

Exactly, this change occurred from following commit.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=147e3722f7e531f15ba389a4d518efe8cd0bd736)
I think, from this commit, TidListEval() came to call
table_beginscan() , so this increments would be happen.

I'm not sure this change whether intention or not, it can confuse some users.

Best regards,
--
NTT Open Source Software Center
Tatsuhito Kasahara

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_increments_seqscan_num.patchapplication/octet-stream; name=fix_tidscan_increments_seqscan_num.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..de2de9a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -830,6 +830,21 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
#3Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#2)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/01/29 20:06, Kasahara Tatsuhito wrote:

Hi.

Attached patch solve this problem.

This patch adds table_beginscan_tid() and call it in TidListEval()
instead of table_beginscan().
table_beginscan_tid() is the same as table_beginscan() but do not set
SO_TYPE_SEQSCAN to flags.

Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.

Is it worth to apply to HEAD and v12 branch ?

I've not read the patch yet, but I agree that updating only seq_scan
but not seq_tup_read in Tid Scan sounds strange. IMO at least
both should be update together or neither should be updated.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#3)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hello.

At Wed, 29 Jan 2020 23:24:09 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/01/29 20:06, Kasahara Tatsuhito wrote:

Hi.
Attached patch solve this problem.
This patch adds table_beginscan_tid() and call it in TidListEval()
instead of table_beginscan().
table_beginscan_tid() is the same as table_beginscan() but do not set
SO_TYPE_SEQSCAN to flags.
Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.
Is it worth to apply to HEAD and v12 branch ?

I've not read the patch yet, but I agree that updating only seq_scan
but not seq_tup_read in Tid Scan sounds strange. IMO at least
both should be update together or neither should be updated.

Basically agreed, but sample scan doesn't increment seq_scan but
increments seq_tup_read.

Aside from that fact, before 147e3722f7 TidScan didn't need a scan
descriptor so didn't call table_beginscan. table_beginscan didn't
increment the counter for bitmapscan and samplescan. The commit
changes TidScan to call beginscan but didn't change table_beginscan
not to increment the counter for tidscans.

From the view of the view pg_stat_*_tables, the counters moves as follows.

increments
scan type table_beginscan?, per scan, per tuple , SO_TYPE flags
=============================================================================
seq scan : yes , seq_scan, seq_tup_read , SO_TYPE_SEQSCAN
index scan : no , idx_scan, idx_tup_fetch , <none>
bitmap scan: yes , idx_scan, idx_tup_fetch , SO_TYPE_BITMAPSCAN
sample scan: yes , <none> , seq_tup_read , SO_TYPE_SAMPLESCAN
TID scan : yes , seq_scan, <none> , <none>

bitmap scan and sample scan are historically excluded by corresponding
flags is_bitmapscan and is_samplescan and the commit c3b23ae457 moved
the work to SO_TYPE_* flags. After 147e3722f7, TID scan has the same
characteristics, that is, it calls table_beginscan but doesn't
increment seq_scan. But it doesn't have corresponding flag value.

I'd rather think that whatever calls table_beginscan should have
corresponding SO_TYPE_* flags. (Note: index scan doesn't call it.)

It would be another issue what we should do for the sample scan case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On Thu, Jan 30, 2020 at 10:55 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 29 Jan 2020 23:24:09 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/01/29 20:06, Kasahara Tatsuhito wrote:

Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.
Is it worth to apply to HEAD and v12 branch ?

I've not read the patch yet, but I agree that updating only seq_scan
but not seq_tup_read in Tid Scan sounds strange. IMO at least
both should be update together or neither should be updated.

Basically agreed, but sample scan doesn't increment seq_scan but
increments seq_tup_read.

Yeah, sample scan's behavior is also bit annoying.

From the view of the view pg_stat_*_tables, the counters moves as follows.

Thanks for your clarification.

TID scan : yes , seq_scan, <none> , <none>

Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
from commit 147e3722f7.

So, currently( v12 and HEAD) TID scan status as following

increments
scan type table_beginscan?, per scan, per tuple , SO_TYPE flags
=============================================================================
TID scan : yes , seq_scan, <none> , SO_TYPE_SEQSCAN

And my patch change the status to following (same as -v11)

increments
scan type table_beginscan?, per scan, per tuple , SO_TYPE flags
=============================================================================
TID scan : yes , <none>, <none> , <none>

I'd rather think that whatever calls table_beginscan should have
corresponding SO_TYPE_* flags. (Note: index scan doesn't call it.)

Agreed.
It may be better to add new flag such like SO_TYPE_TIDSCAN,
and handles some statistics updating and other things.
But it may be a bit overkill, since I want to revert to the previous
behavior this time.

Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kasahara Tatsuhito (#5)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

At Thu, 30 Jan 2020 13:30:56 +0900, Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote in

TID scan : yes , seq_scan, <none> , <none>

Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
from commit 147e3722f7.

It is reflectings the discussion below, which means TID scan doesn't
have corresponding SO_TYPE_ value. Currently it is setting
SO_TYPE_SEQSCAN by accedent.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kyotaro Horiguchi (#6)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Thu, Jan 30, 2020 at 1:49 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 30 Jan 2020 13:30:56 +0900, Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote in

TID scan : yes , seq_scan, <none> , <none>

Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
from commit 147e3722f7.

It is reflectings the discussion below, which means TID scan doesn't
have corresponding SO_TYPE_ value. Currently it is setting
SO_TYPE_SEQSCAN by accedent.

Ah, sorry I misunderstood..

Upon further investigation, the SO_TYPE_SEQSCAN flag was also used at
heap_beginscan() to determine whether a predicate lock was taken on
the entire relation.

if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
{
/*
* Ensure a missing snapshot is noticed reliably, even if the
* isolation mode means predicate locking isn't performed (and
* therefore the snapshot isn't used here).
*/
Assert(snapshot);
PredicateLockRelation(relation, snapshot);
}

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.

Attach the v2 patch which change the status to following. (same as
-v11 but have new SO_TYPE_TIDSCAN flag)

increments
scan type table_beginscan?, per scan, per tuple , SO_TYPE flags
=============================================================================
TID scan : yes , <none>, <none> , SO_TYPE_TIDSCAN

Is it acceptable change for HEAD and v12?

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_increments_seqscan_num_v2.patchapplication/octet-stream; name=fix_tidscan_increments_seqscan_num_v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 854136e..fde4fb7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1165,7 +1165,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 		scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
 
 	/*
-	 * For seqscan and sample scans in a serializable transaction, acquire a
+	 * For seqscan, sample scans and tid scan in a serializable transaction, acquire a
 	 * predicate lock on the entire relation. This is required not only to
 	 * lock all the matching tuples, but also to conflict with new insertions
 	 * into the table. In an indexscan, we take page locks on the index pages
@@ -1177,7 +1177,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	 * to be at least page-level granularity, but we'd need to add per-tuple
 	 * locking for that.
 	 */
-	if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
+	if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN | SO_TYPE_TIDSCAN))
 	{
 		/*
 		 * Ensure a missing snapshot is noticed reliably, even if the
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..f2c069e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -47,18 +47,19 @@ typedef enum ScanOptions
 	SO_TYPE_SEQSCAN = 1 << 0,
 	SO_TYPE_BITMAPSCAN = 1 << 1,
 	SO_TYPE_SAMPLESCAN = 1 << 2,
-	SO_TYPE_ANALYZE = 1 << 3,
+	SO_TYPE_TIDSCAN = 1 << 3,
+	SO_TYPE_ANALYZE = 1 << 4,
 
 	/* several of SO_ALLOW_* may be specified */
 	/* allow or disallow use of access strategy */
-	SO_ALLOW_STRAT = 1 << 4,
+	SO_ALLOW_STRAT = 1 << 5,
 	/* report location to syncscan logic? */
-	SO_ALLOW_SYNC = 1 << 5,
+	SO_ALLOW_SYNC = 1 << 6,
 	/* verify visibility page-at-a-time? */
-	SO_ALLOW_PAGEMODE = 1 << 6,
+	SO_ALLOW_PAGEMODE = 1 << 7,
 
 	/* unregister snapshot at scan end? */
-	SO_TEMP_SNAPSHOT = 1 << 7
+	SO_TEMP_SNAPSHOT = 1 << 8
 } ScanOptions;
 
 /*
@@ -830,6 +831,22 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_TYPE_TIDSCAN |
+	SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#7)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/02/01 16:05, Kasahara Tatsuhito wrote:

On Thu, Jan 30, 2020 at 1:49 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 30 Jan 2020 13:30:56 +0900, Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote in

TID scan : yes , seq_scan, <none> , <none>

Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
from commit 147e3722f7.

It is reflectings the discussion below, which means TID scan doesn't
have corresponding SO_TYPE_ value. Currently it is setting
SO_TYPE_SEQSCAN by accedent.

Ah, sorry I misunderstood..

Upon further investigation, the SO_TYPE_SEQSCAN flag was also used at
heap_beginscan() to determine whether a predicate lock was taken on
the entire relation.

if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
{
/*
* Ensure a missing snapshot is noticed reliably, even if the
* isolation mode means predicate locking isn't performed (and
* therefore the snapshot isn't used here).
*/
Assert(snapshot);
PredicateLockRelation(relation, snapshot);
}

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.

But in the old behavior, PredicateLockRelation() was not called in TidScan case
because its flag was not SO_TYPE_SEQSCAN. No?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#9Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Fujii Masao (#8)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Mon, Feb 3, 2020 at 4:22 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/02/01 16:05, Kasahara Tatsuhito wrote:

if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
{
/*
* Ensure a missing snapshot is noticed reliably, even if the
* isolation mode means predicate locking isn't performed (and
* therefore the snapshot isn't used here).
*/
Assert(snapshot);
PredicateLockRelation(relation, snapshot);
}

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.

But in the old behavior, PredicateLockRelation() was not called in TidScan case
because its flag was not SO_TYPE_SEQSCAN. No?

No. Tid scan called PredicateLockRelation() both previous and current.

In the current (v12 and HEAD), Tid scan has SO_TYPE_SEQSCAN flag so
that PredicateLockRelation()is called in Tid scan.
In the Previous (- v11), in heap_beginscan_internal(), checks
is_bitmapscan flags.
If is_bitmapscan is set to false, calls PredicateLockRelation().

(- v11)
if (!is_bitmapscan)
PredicateLockRelation(relation, snapshot);

And in the Tid scan, is_bitmapscan is set to false, so that
PredicateLockRelation()is called in Tid scan.

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#10Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#9)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/02/03 16:39, Kasahara Tatsuhito wrote:

On Mon, Feb 3, 2020 at 4:22 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/02/01 16:05, Kasahara Tatsuhito wrote:

if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
{
/*
* Ensure a missing snapshot is noticed reliably, even if the
* isolation mode means predicate locking isn't performed (and
* therefore the snapshot isn't used here).
*/
Assert(snapshot);
PredicateLockRelation(relation, snapshot);
}

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.

But in the old behavior, PredicateLockRelation() was not called in TidScan case
because its flag was not SO_TYPE_SEQSCAN. No?

No. Tid scan called PredicateLockRelation() both previous and current.

In the current (v12 and HEAD), Tid scan has SO_TYPE_SEQSCAN flag so
that PredicateLockRelation()is called in Tid scan.
In the Previous (- v11), in heap_beginscan_internal(), checks
is_bitmapscan flags.
If is_bitmapscan is set to false, calls PredicateLockRelation().

(- v11)
if (!is_bitmapscan)
PredicateLockRelation(relation, snapshot);

And in the Tid scan, is_bitmapscan is set to false, so that
PredicateLockRelation()is called in Tid scan.

Thanks for explaining that! But heap_beginscan_internal() was really
called in TidScan case?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#11Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Fujii Masao (#10)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Mon, Feb 3, 2020 at 5:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for explaining that! But heap_beginscan_internal() was really
called in TidScan case?

Oh, you are right.
Tid Scan started calling table_beginscan from v12 (commit 147e3722f7).
So previously(-v11), Tid Scan might never calls heap_beginscan_internal().

Therefore, from v12, Tid scan not only increases the value of
seq_scan, but also acquires a predicate lock.

Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

#12Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kasahara Tatsuhito (#11)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Mon, Feb 3, 2020 at 6:20 PM Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:

Therefore, from v12, Tid scan not only increases the value of
seq_scan, but also acquires a predicate lock.

Based on further investigation and Fujii's advice, I've summarized
this issue as follows.

From commit 147e3722f7, Tid Scan came to
(A) increments num of seq_scan on pg_stat_*_tables
and
(B) take a predicate lock on the entire relation.

(A) may be confusing to users, so I think it is better to fix it.
For (B), an unexpected serialization error has occurred as follows, so
I think it should be fix.

=========================================================================
[Preparation]
CREATE TABLE tid_test (c1 int, c2 int);
INSERT INTO tid_test SELECT generate_series(1,1000), 0;

[Session-1:]
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
[Session-2:]
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
[Session-1:]
SELECT * FROM tid_test WHERE ctid = '(0,1)';
[Session-2:]
SELECT * FROM tid_test WHERE ctid = '(1,1)';
[Session-1:]
INSERT INTO tid_test SELECT 1001, 10;
[Session-2:]
INSERT INTO tid_test SELECT 1001, 10;
[Session-1:]
COMMIT;
[Session-2:]
COMMIT;

Result:
(-v11): Both session could commit.
(v12-): Session-2 raised error as following because of taking a
predicate lock on the entire table...
--------
ERROR: could not serialize access due to read/write dependencies
among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during
commit attempt.
HINT: The transaction might succeed if retried.
--------
=========================================================================

Attached patch fix both (A) and (B), so that the behavior of Tid Scan
back to the same as before v11.
(As a result, this patch is the same as the one that first attached.)

Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues.patchapplication/octet-stream; name=fix_tidscan_issues.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..de2de9a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -830,6 +830,21 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
#13Andres Freund
andres@anarazel.de
In reply to: Kasahara Tatsuhito (#12)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On 2020-02-05 16:25:25 +0900, Kasahara Tatsuhito wrote:

On Mon, Feb 3, 2020 at 6:20 PM Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:

Therefore, from v12, Tid scan not only increases the value of
seq_scan, but also acquires a predicate lock.

Based on further investigation and Fujii's advice, I've summarized
this issue as follows.

From commit 147e3722f7, Tid Scan came to
(A) increments num of seq_scan on pg_stat_*_tables
and
(B) take a predicate lock on the entire relation.

(A) may be confusing to users, so I think it is better to fix it.
For (B), an unexpected serialization error has occurred as follows, so
I think it should be fix.

I think it'd be good if we could guard against b) via an isolation
test. It's more painful to do that for a), due to the unreliability of
stats at the moment (we have some tests, but they take a long time).

Greetings,

Andres Freund

#14Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Andres Freund (#13)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On Thu, Feb 6, 2020 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:

I think it'd be good if we could guard against b) via an isolation
test. It's more painful to do that for a), due to the unreliability of
stats at the moment (we have some tests, but they take a long time).

Thanks for your advise, and agreed.

I added a new (but minimal) isolation test for the case of tid scan.
(v12 and HEAD will be failed this test. v11 and HEAD with my patch
will be passed)

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues_v2.patchapplication/octet-stream; name=fix_tidscan_issues_v2.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..de2de9a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -830,6 +830,21 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
diff --git a/src/test/isolation/expected/tid-scan.out b/src/test/isolation/expected/tid-scan.out
new file mode 100644
index 0000000..6801da5
--- /dev/null
+++ b/src/test/isolation/expected/tid-scan.out
@@ -0,0 +1,30 @@
+Parsed test spec with 2 sessions
+
+starting permutation: select1 select2 insert1 insert2 c1 c2
+step select1: SELECT * FROM t WHERE ctid = '(0,1)';
+a              
+
+1              
+step select2: SELECT * FROM t WHERE ctid = '(1,1)';
+a              
+
+227            
+step insert1: INSERT INTO t SELECT 501;
+step insert2: INSERT INTO t SELECT 501;
+step c1: COMMIT;
+step c2: COMMIT;
+
+starting permutation: select1 select2 update1 update2 c1 c2
+step select1: SELECT * FROM t WHERE ctid = '(0,1)';
+a              
+
+1              
+step select2: SELECT * FROM t WHERE ctid = '(1,1)';
+a              
+
+227            
+step update1: UPDATE t SET a = 501 WHERE ctid = '(1,1)';
+step update2: UPDATE t SET a = 501 WHERE ctid = '(0,1)';
+step c1: COMMIT;
+step c2: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
diff --git a/src/test/isolation/specs/tid-scan.spec b/src/test/isolation/specs/tid-scan.spec
new file mode 100644
index 0000000..f50fb8b
--- /dev/null
+++ b/src/test/isolation/specs/tid-scan.spec
@@ -0,0 +1,42 @@
+# TID Scan test
+#
+# This test checks for unnecessary predicate locks on the entire 
+# table during a Tid scan with serializable isolation.
+
+setup
+{
+  CREATE TABLE t (a int);
+  INSERT INTO t SELECT generate_series(1,500);
+}
+
+teardown
+{
+  DROP TABLE t;
+}
+
+session "s1"
+setup
+{
+  BEGIN ISOLATION LEVEL SERIALIZABLE;
+}
+step "select1" { SELECT * FROM t WHERE ctid = '(0,1)'; }
+step "insert1" { INSERT INTO t SELECT 501; }
+step "update1" { UPDATE t SET a = 501 WHERE ctid = '(1,1)'; }
+step "c1" { COMMIT; }
+
+session "s2"
+setup
+{
+  BEGIN ISOLATION LEVEL SERIALIZABLE;
+}
+step "select2" { SELECT * FROM t WHERE ctid = '(1,1)'; }
+step "insert2" { INSERT INTO t SELECT 501; }
+step "update2" { UPDATE t SET a = 501 WHERE ctid = '(0,1)'; }
+step "c2" { COMMIT; }
+
+# This case, both session will be commit.
+permutation "select1" "select2" "insert1" "insert2" "c1" "c2"
+
+# This case, session s2 will be failed because of 
+# serializable isolation violation.
+permutation "select1" "select2" "update1" "update2" "c1" "c2"
#15Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#14)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/02/06 15:04, Kasahara Tatsuhito wrote:

Hi,

On Thu, Feb 6, 2020 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:

I think it'd be good if we could guard against b) via an isolation
test. It's more painful to do that for a), due to the unreliability of
stats at the moment (we have some tests, but they take a long time).

Thanks for your advise, and agreed.

I added a new (but minimal) isolation test for the case of tid scan.
(v12 and HEAD will be failed this test. v11 and HEAD with my patch
will be passed)

Isn't this test scenario a bit overkill? We can simply test that
as follows, instead.

CREATE TABLE test_tidscan AS SELECT 1 AS id;
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM test_tidscan WHERE ctid = '(0,1)';
SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
COMMIT;

In the expected file, the result of query looking at pg_locks
should be matched with the following.

locktype | mode
----------+------------
tuple | SIReadLock

BTW, in master branch, locktype in that query result is "relation"
because of the issue.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#16Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Fujii Masao (#15)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

HI,

On Thu, Feb 6, 2020 at 3:24 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

I added a new (but minimal) isolation test for the case of tid scan.
(v12 and HEAD will be failed this test. v11 and HEAD with my patch
will be passed)

Isn't this test scenario a bit overkill? We can simply test that
as follows, instead.
CREATE TABLE test_tidscan AS SELECT 1 AS id;
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM test_tidscan WHERE ctid = '(0,1)';
SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
COMMIT;

In the expected file, the result of query looking at pg_locks
should be matched with the following.

locktype | mode
----------+------------
tuple | SIReadLock

Thanks for your reply.
Hmm, it's an simple and might be the better way than adding isolation test.

I added above test case to regress/sql/tidscan.sql.
Attach the patch.

Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues_v3.patchapplication/octet-stream; name=fix_tidscan_issues_v3.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..de2de9a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -830,6 +830,21 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
diff --git a/src/test/regress/expected/tidscan.out b/src/test/regress/expected/tidscan.out
index 9b5eb04..13c3c36 100644
--- a/src/test/regress/expected/tidscan.out
+++ b/src/test/regress/expected/tidscan.out
@@ -277,4 +277,20 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 (1 row)
 
 RESET enable_hashjoin;
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+ id 
+----
+  1
+(1 row)
+
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ locktype |    mode    
+----------+------------
+ tuple    | SIReadLock
+(1 row)
+
+ROLLBACK;
 DROP TABLE tidscan;
diff --git a/src/test/regress/sql/tidscan.sql b/src/test/regress/sql/tidscan.sql
index ef05c09..313e0fb 100644
--- a/src/test/regress/sql/tidscan.sql
+++ b/src/test/regress/sql/tidscan.sql
@@ -94,4 +94,11 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 RESET enable_hashjoin;
 
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ROLLBACK;
+
 DROP TABLE tidscan;
#17Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Kasahara Tatsuhito (#16)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Thu, 6 Feb 2020 at 19:12, Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:

HI,

On Thu, Feb 6, 2020 at 3:24 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

I added a new (but minimal) isolation test for the case of tid scan.
(v12 and HEAD will be failed this test. v11 and HEAD with my patch
will be passed)

Isn't this test scenario a bit overkill? We can simply test that
as follows, instead.
CREATE TABLE test_tidscan AS SELECT 1 AS id;
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM test_tidscan WHERE ctid = '(0,1)';
SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
COMMIT;

In the expected file, the result of query looking at pg_locks
should be matched with the following.

locktype | mode
----------+------------
tuple | SIReadLock

Thanks for your reply.
Hmm, it's an simple and might be the better way than adding isolation test.

I added above test case to regress/sql/tidscan.sql.
Attach the patch.

I've tested predicate locking including promotion cases with v3 patch
and it works fine.

+table_beginscan_tid(Relation rel, Snapshot snapshot,
+               int nkeys, struct ScanKeyData *key)
+{
+   uint32      flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;

IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
no meaning during tid scan. I think v11 also should be the same.

Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
think it's better to add it and then we can set only SO_TYPE_TIDSCAN
to the scan option of tid scan.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Masahiko Sawada (#17)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On Thu, Feb 6, 2020 at 11:01 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Thu, 6 Feb 2020 at 19:12, Kasahara Tatsuhito
<kasahara.tatsuhito@gmail.com> wrote:
I've tested predicate locking including promotion cases with v3 patch
and it works fine.

+table_beginscan_tid(Relation rel, Snapshot snapshot,
+               int nkeys, struct ScanKeyData *key)
+{
+   uint32      flags = SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE;

IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
no meaning during tid scan. I think v11 also should be the same.

Thanks for your check, and useful advise.
I was wondering if I should keep these flags, but I confirmed that I
can remove these from TidScan's flags.

Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
think it's better to add it and then we can set only SO_TYPE_TIDSCAN
to the scan option of tid scan.

Because, currently SO_TYPE_TIDSCAN is not used anywhere.
So I thought it might be better to avoid adding a new flag.
However, as you said, this flag may be useful for the future tid scan
feature (like [1]/messages/by-id/CAKJS1f-+JJpm6B_NThUWzFv4007zAjObBXX1CBHE_bH9nOAvSw@mail.gmail.com)

Attach v4 patch. I re-added SO_TYPE_TIDSCAN to ScanOptions and
table_beginscan_tid.
And I remove unnecessary SO_ALLOW_* flags.

Best regards,

[1]: /messages/by-id/CAKJS1f-+JJpm6B_NThUWzFv4007zAjObBXX1CBHE_bH9nOAvSw@mail.gmail.com
/messages/by-id/CAKJS1f-+JJpm6B_NThUWzFv4007zAjObBXX1CBHE_bH9nOAvSw@mail.gmail.com

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues_v4.patchapplication/octet-stream; name=fix_tidscan_issues_v4.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f91f717 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,7 +143,7 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
 							tidstate->ss.ps.state->es_snapshot,
 							0, NULL);
 	scan = tidstate->ss.ss_currentScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..162e1b6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -47,18 +47,19 @@ typedef enum ScanOptions
 	SO_TYPE_SEQSCAN = 1 << 0,
 	SO_TYPE_BITMAPSCAN = 1 << 1,
 	SO_TYPE_SAMPLESCAN = 1 << 2,
-	SO_TYPE_ANALYZE = 1 << 3,
+	SO_TYPE_TIDSCAN = 1 << 3,
+	SO_TYPE_ANALYZE = 1 << 4,
 
 	/* several of SO_ALLOW_* may be specified */
 	/* allow or disallow use of access strategy */
-	SO_ALLOW_STRAT = 1 << 4,
+	SO_ALLOW_STRAT = 1 << 5,
 	/* report location to syncscan logic? */
-	SO_ALLOW_SYNC = 1 << 5,
+	SO_ALLOW_SYNC = 1 << 6,
 	/* verify visibility page-at-a-time? */
-	SO_ALLOW_PAGEMODE = 1 << 6,
+	SO_ALLOW_PAGEMODE = 1 << 7,
 
 	/* unregister snapshot at scan end? */
-	SO_TEMP_SNAPSHOT = 1 << 7
+	SO_TEMP_SNAPSHOT = 1 << 8
 } ScanOptions;
 
 /*
@@ -830,6 +831,23 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ * For now, There are no cases where SO_TYPE_TIDSCAN is used, 
+ * but it may be useful to improve future Tid scans.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_TYPE_TIDSCAN;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
diff --git a/src/test/regress/expected/tidscan.out b/src/test/regress/expected/tidscan.out
index 9b5eb04..13c3c36 100644
--- a/src/test/regress/expected/tidscan.out
+++ b/src/test/regress/expected/tidscan.out
@@ -277,4 +277,20 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 (1 row)
 
 RESET enable_hashjoin;
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+ id 
+----
+  1
+(1 row)
+
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ locktype |    mode    
+----------+------------
+ tuple    | SIReadLock
+(1 row)
+
+ROLLBACK;
 DROP TABLE tidscan;
diff --git a/src/test/regress/sql/tidscan.sql b/src/test/regress/sql/tidscan.sql
index ef05c09..313e0fb 100644
--- a/src/test/regress/sql/tidscan.sql
+++ b/src/test/regress/sql/tidscan.sql
@@ -94,4 +94,11 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 RESET enable_hashjoin;
 
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ROLLBACK;
+
 DROP TABLE tidscan;
#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kasahara Tatsuhito (#18)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

At Fri, 7 Feb 2020 12:27:26 +0900, Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote in

IIUC setting SO_ALLOW_STRAT, SO_ALLOW_SYNC and SO_ALLOW_PAGEMODE has
no meaning during tid scan. I think v11 also should be the same.

Thanks for your check, and useful advise.
I was wondering if I should keep these flags, but I confirmed that I
can remove these from TidScan's flags.

Why did you remove SO_TYPE_TIDSCAN from the previous version patch? I
think it's better to add it and then we can set only SO_TYPE_TIDSCAN
to the scan option of tid scan.

Because, currently SO_TYPE_TIDSCAN is not used anywhere.
So I thought it might be better to avoid adding a new flag.
However, as you said, this flag may be useful for the future tid scan
feature (like [1])

Attach v4 patch. I re-added SO_TYPE_TIDSCAN to ScanOptions and
table_beginscan_tid.
And I remove unnecessary SO_ALLOW_* flags.

+table_beginscan_tid(Relation rel, Snapshot snapshot,
+				int nkeys, struct ScanKeyData *key)
+{
+	uint32		flags = SO_TYPE_TIDSCAN;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#20Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Kyotaro Horiguchi (#19)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
and * key is set to NULL,
so these are not used at the moment.

I removed unnecessary arguments from table_beginscan_tid().

Attache the v5 patch.

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues_v5.patchapplication/octet-stream; name=fix_tidscan_issues_v5.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f0d4883 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,9 +143,8 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
-							tidstate->ss.ps.state->es_snapshot,
-							0, NULL);
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
+							tidstate->ss.ps.state->es_snapshot);
 	scan = tidstate->ss.ss_currentScanDesc;
 
 	/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..aab927d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -47,18 +47,19 @@ typedef enum ScanOptions
 	SO_TYPE_SEQSCAN = 1 << 0,
 	SO_TYPE_BITMAPSCAN = 1 << 1,
 	SO_TYPE_SAMPLESCAN = 1 << 2,
-	SO_TYPE_ANALYZE = 1 << 3,
+	SO_TYPE_TIDSCAN = 1 << 3,
+	SO_TYPE_ANALYZE = 1 << 4,
 
 	/* several of SO_ALLOW_* may be specified */
 	/* allow or disallow use of access strategy */
-	SO_ALLOW_STRAT = 1 << 4,
+	SO_ALLOW_STRAT = 1 << 5,
 	/* report location to syncscan logic? */
-	SO_ALLOW_SYNC = 1 << 5,
+	SO_ALLOW_SYNC = 1 << 6,
 	/* verify visibility page-at-a-time? */
-	SO_ALLOW_PAGEMODE = 1 << 6,
+	SO_ALLOW_PAGEMODE = 1 << 7,
 
 	/* unregister snapshot at scan end? */
-	SO_TEMP_SNAPSHOT = 1 << 7
+	SO_TEMP_SNAPSHOT = 1 << 8
 } ScanOptions;
 
 /*
@@ -830,6 +831,22 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ * For now, There are no cases where SO_TYPE_TIDSCAN is used, 
+ * but it may be useful to improve future Tid scans.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot)
+{
+	uint32		flags = SO_TYPE_TIDSCAN;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
diff --git a/src/test/regress/expected/tidscan.out b/src/test/regress/expected/tidscan.out
index 9b5eb04..13c3c36 100644
--- a/src/test/regress/expected/tidscan.out
+++ b/src/test/regress/expected/tidscan.out
@@ -277,4 +277,20 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 (1 row)
 
 RESET enable_hashjoin;
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+ id 
+----
+  1
+(1 row)
+
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ locktype |    mode    
+----------+------------
+ tuple    | SIReadLock
+(1 row)
+
+ROLLBACK;
 DROP TABLE tidscan;
diff --git a/src/test/regress/sql/tidscan.sql b/src/test/regress/sql/tidscan.sql
index ef05c09..313e0fb 100644
--- a/src/test/regress/sql/tidscan.sql
+++ b/src/test/regress/sql/tidscan.sql
@@ -94,4 +94,11 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 RESET enable_hashjoin;
 
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ROLLBACK;
+
 DROP TABLE tidscan;
#21Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#20)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/02/07 15:07, Kasahara Tatsuhito wrote:

Hi,

On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
and * key is set to NULL,
so these are not used at the moment.

I removed unnecessary arguments from table_beginscan_tid().

Attache the v5 patch.

Thanks for updating the patch! The patch looks good to me.
So barring any objection, I will push it and back-patch to v12 *soon*
so that the upcoming minor version can contain it.

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#22Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Fujii Masao (#21)
1 attachment(s)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

Hi,

On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

+1, sorry, I overlooked it.

Both functions are used to check whether a valid tid or not with a
relation name (or oid),
and both perform a tid scan internally.
So, these functions should call table_beginscan_tid().

Perhaps unnecessary, I will attach a patch.

Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

Attachments:

fix_tidscan_issues_v6.patchapplication/octet-stream; name=fix_tidscan_issues_v6.patchDownload
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 16802b8..f0d4883 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -143,9 +143,8 @@ TidListEval(TidScanState *tidstate)
 	 */
 	if (tidstate->ss.ss_currentScanDesc == NULL)
 		tidstate->ss.ss_currentScanDesc =
-			table_beginscan(tidstate->ss.ss_currentRelation,
-							tidstate->ss.ps.state->es_snapshot,
-							0, NULL);
+			table_beginscan_tid(tidstate->ss.ss_currentRelation,
+							tidstate->ss.ps.state->es_snapshot);
 	scan = tidstate->ss.ss_currentScanDesc;
 
 	/*
diff --git a/src/backend/utils/adt/tid.c b/src/backend/utils/adt/tid.c
index 2a94eff..fad2057 100644
--- a/src/backend/utils/adt/tid.c
+++ b/src/backend/utils/adt/tid.c
@@ -381,7 +381,7 @@ currtid_byreloid(PG_FUNCTION_ARGS)
 	ItemPointerCopy(tid, result);
 
 	snapshot = RegisterSnapshot(GetLatestSnapshot());
-	scan = table_beginscan(rel, snapshot, 0, NULL);
+	scan = table_beginscan_tid(rel, snapshot);
 	table_tuple_get_latest_tid(scan, result);
 	table_endscan(scan);
 	UnregisterSnapshot(snapshot);
@@ -419,7 +419,7 @@ currtid_byrelname(PG_FUNCTION_ARGS)
 	ItemPointerCopy(tid, result);
 
 	snapshot = RegisterSnapshot(GetLatestSnapshot());
-	scan = table_beginscan(rel, snapshot, 0, NULL);
+	scan = table_beginscan_tid(rel, snapshot);
 	table_tuple_get_latest_tid(scan, result);
 	table_endscan(scan);
 	UnregisterSnapshot(snapshot);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f..aab927d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -47,18 +47,19 @@ typedef enum ScanOptions
 	SO_TYPE_SEQSCAN = 1 << 0,
 	SO_TYPE_BITMAPSCAN = 1 << 1,
 	SO_TYPE_SAMPLESCAN = 1 << 2,
-	SO_TYPE_ANALYZE = 1 << 3,
+	SO_TYPE_TIDSCAN = 1 << 3,
+	SO_TYPE_ANALYZE = 1 << 4,
 
 	/* several of SO_ALLOW_* may be specified */
 	/* allow or disallow use of access strategy */
-	SO_ALLOW_STRAT = 1 << 4,
+	SO_ALLOW_STRAT = 1 << 5,
 	/* report location to syncscan logic? */
-	SO_ALLOW_SYNC = 1 << 5,
+	SO_ALLOW_SYNC = 1 << 6,
 	/* verify visibility page-at-a-time? */
-	SO_ALLOW_PAGEMODE = 1 << 6,
+	SO_ALLOW_PAGEMODE = 1 << 7,
 
 	/* unregister snapshot at scan end? */
-	SO_TEMP_SNAPSHOT = 1 << 7
+	SO_TEMP_SNAPSHOT = 1 << 8
 } ScanOptions;
 
 /*
@@ -830,6 +831,22 @@ table_beginscan_sampling(Relation rel, Snapshot snapshot,
 }
 
 /*
+ * table_beginscan_tid is an alternative entry point for setting up a
+ * TableScanDesc for a Tid scan.  Although that scan technology is
+ * really quite unlike a standard seqscan, there is just enough commonality to
+ * make it worth using the same data structure.
+ * For now, There are no cases where SO_TYPE_TIDSCAN is used, 
+ * but it may be useful to improve future Tid scans.
+ */
+static inline TableScanDesc
+table_beginscan_tid(Relation rel, Snapshot snapshot)
+{
+	uint32		flags = SO_TYPE_TIDSCAN;
+
+	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
+}
+
+/*
  * table_beginscan_analyze is an alternative entry point for setting up a
  * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
  * the same data structure although the behavior is rather different.
diff --git a/src/test/regress/expected/tidscan.out b/src/test/regress/expected/tidscan.out
index 9b5eb04..13c3c36 100644
--- a/src/test/regress/expected/tidscan.out
+++ b/src/test/regress/expected/tidscan.out
@@ -277,4 +277,20 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 (1 row)
 
 RESET enable_hashjoin;
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+ id 
+----
+  1
+(1 row)
+
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ locktype |    mode    
+----------+------------
+ tuple    | SIReadLock
+(1 row)
+
+ROLLBACK;
 DROP TABLE tidscan;
diff --git a/src/test/regress/sql/tidscan.sql b/src/test/regress/sql/tidscan.sql
index ef05c09..313e0fb 100644
--- a/src/test/regress/sql/tidscan.sql
+++ b/src/test/regress/sql/tidscan.sql
@@ -94,4 +94,11 @@ SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 SELECT count(*) FROM tenk1 t1 JOIN tenk1 t2 ON t1.ctid = t2.ctid;
 RESET enable_hashjoin;
 
+-- check predicate lock on CTID
+BEGIN ISOLATION LEVEL SERIALIZABLE;
+SELECT * FROM tidscan WHERE ctid = '(0,1)';
+-- locktype should be 'tuple'
+SELECT locktype, mode FROM pg_locks WHERE pid = pg_backend_pid() AND mode = 'SIReadLock';
+ROLLBACK;
+
 DROP TABLE tidscan;
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#21)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

At Fri, 7 Feb 2020 17:01:59 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/02/07 15:07, Kasahara Tatsuhito wrote:

Hi,
On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
and * key is set to NULL,
so these are not used at the moment.
I removed unnecessary arguments from table_beginscan_tid().
Attache the v5 patch.

Thanks for updating the patch! The patch looks good to me.
So barring any objection, I will push it and back-patch to v12 *soon*
so that the upcoming minor version can contain it.

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

At least they don't seem to need table_beginscan(), to me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#24Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kasahara Tatsuhito (#22)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On 2020/02/07 17:28, Kasahara Tatsuhito wrote:

Hi,

On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

+1, sorry, I overlooked it.

Both functions are used to check whether a valid tid or not with a
relation name (or oid),
and both perform a tid scan internally.
So, these functions should call table_beginscan_tid().

Perhaps unnecessary, I will attach a patch.

Pushed! Thanks!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#25Kasahara Tatsuhito
kasahara.tatsuhito@gmail.com
In reply to: Fujii Masao (#24)
Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

On Fri, Feb 7, 2020 at 10:09 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Pushed! Thanks!

Thanks Fujii.

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com