Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

Started by Bharath Rupireddyalmost 5 years ago6 messages
#1Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com

Hi,

I'm reading the code for vacuum/analyze and it looks like currently we
call vacuum_rel/analyze_rel for each relation specified. Which means
that if a relation is specified more than once, then we simply
vacuum/analyze it that many times. Do we gain any advantage by
vacuuming/analyzing a relation back-to-back within a single command? I
strongly feel no. I'm thinking we could do a simple optimization here,
by transforming following VACUUM/ANALYZE commands to:
1) VACUUM t1, t2, t1, t2, t1; transform to -->
VACUUM t1, t2;
2) VACUUM ANALYZE t1(a1), t2(a2), t1(b1), t2(b2), t1(c1);
transform to --> VACUUM ANALYZE t1(a1, b1, c1), t2(a2, b2)
3) ANALYZE t1, t2, t1, t2, t1; transform to -->
ANALYZE t1, t2;
4) ANALYZE t1(a1), t2(a2), t1(b1), t2(b2), t1(c1);
transform to --> ANALYZE t1(a1, b1, c1), t2(a2, b2)

Above use cases may look pretty much unsound and we could think of
disallowing with an error for the use cases (1) and 3(), but the use
cases (2) and (4) are quite possible in customer scenarios(??). Please
feel free to add any other use cases you may think of.

The main advantage of the above said optimization is that the commands
can become a bit faster because we will avoid extra processing. I
would like to hear opinions on this. I'm not sure if this optimization
was already given a thought and not done because of some specific
reasons. If so, it will be great if someone can point me to those
discussions. Or it could be that I'm badly missing in my understanding
of current vacuum/analyze code, feel free to correct me.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bharath Rupireddy (#1)
Re: Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

I'm reading the code for vacuum/analyze and it looks like currently we
call vacuum_rel/analyze_rel for each relation specified. Which means
that if a relation is specified more than once, then we simply
vacuum/analyze it that many times. Do we gain any advantage by
vacuuming/analyzing a relation back-to-back within a single command? I
strongly feel no. I'm thinking we could do a simple optimization here,

This really is not something to expend cycles and code complexity on.
If the user wrote the same table more than once, that's their choice.

regards, tom lane

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Tom Lane (#2)
1 attachment(s)
Re: Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

On Sat, Apr 10, 2021 at 8:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

I'm reading the code for vacuum/analyze and it looks like currently we
call vacuum_rel/analyze_rel for each relation specified. Which means
that if a relation is specified more than once, then we simply
vacuum/analyze it that many times. Do we gain any advantage by
vacuuming/analyzing a relation back-to-back within a single command? I
strongly feel no. I'm thinking we could do a simple optimization here,

This really is not something to expend cycles and code complexity on.
If the user wrote the same table more than once, that's their choice.

Thanks! I think we could avoid extra processing costs for cases like
VACUUM/ANALYZE foo, foo; when no explicit columns are specified. The
avoided costs can be lock acquire, relation open, vacuum/analyze,
relation close, starting new xact command, command counter increment
in case of analyze etc. This can be done with a simple patch like the
attached. When explicit columns are specified along with relations
i.e. VACUUM/ANALYZE foo(c1), foo(c2); we don't want to do the extra
complex processing to optimize the cases when c1 = c2.

Note that the TRUNCATE command currently skips processing repeated
relations (see ExecuteTruncate). For example, TRUNCATE foo, foo; and
TRUNCATE foo, ONLY foo, foo; first instance of relation foo is taken
into consideration for processing and other relation instances
(options specified if any) are ignored.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Skip-VACUUM-ANALYZE-of-repeated-relations.patchapplication/octet-stream; name=v1-0001-Skip-VACUUM-ANALYZE-of-repeated-relations.patchDownload
From 2c03dbd19ea485cccf8e0ec348a5710e196a3d7c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddy@enterprisedb.com>
Date: Wed, 21 Apr 2021 07:31:25 +0530
Subject: [PATCH v1] Skip VACUUM/ANALYZE of repeated relations

Skip vacuuming/analyzing of repeated relations to avoid
unnecessary processing. We do this only when no explicit
columns are specified, for instance, "VACUUM/ANALYZE foo,
foo;". When columns are specified along with relations
i.e. "VACUUM/ANALYZE foo(col1), foo(col2);", we don't want
to further optimize the cases when col1 = col2.
---
 src/backend/commands/vacuum.c        | 20 +++++++++++++++++++-
 src/test/regress/expected/vacuum.out |  2 +-
 src/test/regress/sql/vacuum.sql      |  2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 39df05c735..09d7971b15 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -438,6 +438,7 @@ vacuum(List *relations, VacuumParams *params,
 	PG_TRY();
 	{
 		ListCell   *cur;
+		List	*relids = NULL;
 
 		in_vacuum = true;
 		VacuumCostActive = (VacuumCostDelay > 0);
@@ -456,6 +457,16 @@ vacuum(List *relations, VacuumParams *params,
 		{
 			VacuumRelation *vrel = lfirst_node(VacuumRelation, cur);
 
+			/*
+			 * Skip VACUUM/ANALYZE of repeated relations. We do this only when
+			 * no explicit columns are specified, for instance, "VACUUM/ANALYZE
+			 * foo, foo;". When columns are specified along with relations
+			 * i.e. "VACUUM/ANALYZE foo(col1), foo(col2);", we don't want to
+			 * further optimize the cases when col1 = col2.
+			 */
+			if (vrel->va_cols == NULL && list_member_oid(relids, vrel->oid))
+				continue;
+
 			if (params->options & VACOPT_VACUUM)
 			{
 				if (!vacuum_rel(vrel->oid, vrel->relation, params))
@@ -488,11 +499,18 @@ vacuum(List *relations, VacuumParams *params,
 					/*
 					 * If we're not using separate xacts, better separate the
 					 * ANALYZE actions with CCIs.  This avoids trouble if user
-					 * says "ANALYZE t, t".
+					 * says "ANALYZE t(col1), t(col2)".
 					 */
 					CommandCounterIncrement();
 				}
 			}
+
+			/*
+			 * Remember the VACUUMed/ANALYZEed relation only when no explicit
+			 * columns are specified.
+			 */
+			if (vrel->va_cols == NULL)
+				relids = lappend_oid(relids, vrel->oid);
 		}
 	}
 	PG_FINALLY();
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 90cea6caa8..a67c56371d 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -218,7 +218,7 @@ ANALYZE vactst (i), vacparted (does_not_exist);
 ERROR:  column "does_not_exist" of relation "vacparted" does not exist
 ANALYZE vactst, vactst;
 BEGIN;  -- ANALYZE behaves differently inside a transaction block
-ANALYZE vactst, vactst;
+ANALYZE vactst (i), vactst (i);
 COMMIT;
 -- parenthesized syntax for ANALYZE
 ANALYZE (VERBOSE) does_not_exist;
diff --git a/src/test/regress/sql/vacuum.sql b/src/test/regress/sql/vacuum.sql
index 93fd258fc0..7daa3a7723 100644
--- a/src/test/regress/sql/vacuum.sql
+++ b/src/test/regress/sql/vacuum.sql
@@ -183,7 +183,7 @@ ANALYZE vactst, does_not_exist, vacparted;
 ANALYZE vactst (i), vacparted (does_not_exist);
 ANALYZE vactst, vactst;
 BEGIN;  -- ANALYZE behaves differently inside a transaction block
-ANALYZE vactst, vactst;
+ANALYZE vactst (i), vactst (i);
 COMMIT;
 
 -- parenthesized syntax for ANALYZE
-- 
2.25.1

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#3)
Re: Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

At Wed, 21 Apr 2021 07:34:40 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Sat, Apr 10, 2021 at 8:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

I'm reading the code for vacuum/analyze and it looks like currently we
call vacuum_rel/analyze_rel for each relation specified. Which means
that if a relation is specified more than once, then we simply
vacuum/analyze it that many times. Do we gain any advantage by
vacuuming/analyzing a relation back-to-back within a single command? I
strongly feel no. I'm thinking we could do a simple optimization here,

This really is not something to expend cycles and code complexity on.
If the user wrote the same table more than once, that's their choice.

Thanks! I think we could avoid extra processing costs for cases like
VACUUM/ANALYZE foo, foo; when no explicit columns are specified. The
avoided costs can be lock acquire, relation open, vacuum/analyze,
relation close, starting new xact command, command counter increment
in case of analyze etc. This can be done with a simple patch like the
attached. When explicit columns are specified along with relations
i.e. VACUUM/ANALYZE foo(c1), foo(c2); we don't want to do the extra
complex processing to optimize the cases when c1 = c2.

Note that the TRUNCATE command currently skips processing repeated
relations (see ExecuteTruncate). For example, TRUNCATE foo, foo; and
TRUNCATE foo, ONLY foo, foo; first instance of relation foo is taken
into consideration for processing and other relation instances
(options specified if any) are ignored.

Thoughts?

Although I don't strongly oppose to check that, the check of truncate
is natural and required. The relation list is anyway used afterwards,
and we cannot truncate the same relation twice or more since a
relation under "use" cannot be truncated. (Truncation is one form of
use). In short, TRUNCATE runs no checking just for the check's own
sake.

On the other hand the patch creates a relation list just for this
purpose, which is not needed to run VACUUM/ANALYZE, and VACUUM/ANALYE
works well with duplicates in target relations.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#4)
Re: Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

On Wed, Apr 21, 2021 at 11:32:49AM +0900, Kyotaro Horiguchi wrote:

On the other hand the patch creates a relation list just for this
purpose, which is not needed to run VACUUM/ANALYZE, and VACUUM/ANALYE
works well with duplicates in target relations.

Yeah, I don't think either that this is worth spending cycles on, not
to mention that the current behavior could be handy as VACUUM uses
separate transactions for each relation vacuumed if more than one
relation is listed in the set.
--
Michael

#6Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: Is it worth to optimize VACUUM/ANALYZE by combining duplicate rel instances into single rel instance?

On Wed, Apr 21, 2021 at 8:02 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks! I think we could avoid extra processing costs for cases like
VACUUM/ANALYZE foo, foo; when no explicit columns are specified. The
avoided costs can be lock acquire, relation open, vacuum/analyze,
relation close, starting new xact command, command counter increment
in case of analyze etc. This can be done with a simple patch like the
attached. When explicit columns are specified along with relations
i.e. VACUUM/ANALYZE foo(c1), foo(c2); we don't want to do the extra
complex processing to optimize the cases when c1 = c2.

Note that the TRUNCATE command currently skips processing repeated
relations (see ExecuteTruncate). For example, TRUNCATE foo, foo; and
TRUNCATE foo, ONLY foo, foo; first instance of relation foo is taken
into consideration for processing and other relation instances
(options specified if any) are ignored.

Thoughts?

Although I don't strongly oppose to check that, the check of truncate
is natural and required. The relation list is anyway used afterwards,
and we cannot truncate the same relation twice or more since a
relation under "use" cannot be truncated. (Truncation is one form of
use). In short, TRUNCATE runs no checking just for the check's own
sake.

Thanks for the point. Yes, if we don't skip repeated instances we do
get below error:
postgres=# truncate t1, t1;
ERROR: cannot TRUNCATE "t1" because it is being used by active
queries in this session

On the other hand the patch creates a relation list just for this
purpose, which is not needed to run VACUUM/ANALYZE, and VACUUM/ANALYE
works well with duplicates in target relations.

Yeah, the relids list is only used to skip the duplicates. I feel
that's okay given the negligible extra processing (searching for the
relids in the list) we add with it versus the extra processing we
avoid with skipping duplicates, see [1]tested on my dev system, with default postgresql.conf, t1 is having 10mn rows: HEAD: postgres=# analyze t1; Time: 363.580 ms postgres=# analyze t1; Time: 384.760 ms.

Although VACUUM/ANALYZE works well with duplicate relations without
any error (unlike TRUNCATE), is there any benefit if we run
back-to-back VACUUM/ANALYZE within a single command? I assume that
there's no benefit. My only point was that even if somebody specifies
duplicate relations, we could avoid some processing effort see [1]tested on my dev system, with default postgresql.conf, t1 is having 10mn rows: HEAD: postgres=# analyze t1; Time: 363.580 ms postgres=# analyze t1; Time: 384.760 ms for
the gain. For ANALYZE, we can avoid doing extra
StartTransactionCommand, CommitTransactionCommand and
CommandCounterIncrement as well.

I know the use cases that I'm trying to optimize with the patch are
worthless and unrealistic (may be written by someone like me). Since
we generally don't optimize for rare and unrecommended scenarios, I'm
okay if we drop this patch. But I would like to mention [1]tested on my dev system, with default postgresql.conf, t1 is having 10mn rows: HEAD: postgres=# analyze t1; Time: 363.580 ms postgres=# analyze t1; Time: 384.760 ms the gain
we get with the patch.

[1]: tested on my dev system, with default postgresql.conf, t1 is having 10mn rows: HEAD: postgres=# analyze t1; Time: 363.580 ms postgres=# analyze t1; Time: 384.760 ms
having 10mn rows:
HEAD:
postgres=# analyze t1;
Time: 363.580 ms
postgres=# analyze t1;
Time: 384.760 ms

postgres=# analyze t1, t1;
Time: 687.976 ms
postgres=# analyze t1, t1;
Time: 664.420 ms

postgres=# analyze t1, t1, t1;
Time: 1010.855 ms (00:01.011)
postgres=# analyze t1, t1, t1;
Time: 1119.970 ms (00:01.120)

postgres=# analyze t1, t1, t1, t1;
Time: 1350.345 ms (00:01.350)
postgres=# analyze t1, t1, t1, t1;
Time: 1316.738 ms (00:01.317)

postgres=# analyze t1, t1, t1, t1, t1;
Time: 1651.780 ms (00:01.652)
postgres=# analyze t1, t1, t1, t1, t1, t1;
Time: 1983.163 ms (00:01.983)

PATCHed:
postgres=# analyze t1;
Time: 356.709 ms
postgres=# analyze t1;
Time: 360.780 ms

postgres=# analyze t1, t1;
Time: 377.193 ms
postgres=# analyze t1, t1;
Time: 370.636 ms

postgres=# analyze t1, t1, t1;
Time: 364.271 ms
postgres=# analyze t1, t1, t1;
Time: 349.988 ms

postgres=# analyze t1, t1, t1, t1;
Time: 362.567 ms
postgres=# analyze t1, t1, t1, t1;
Time: 383.292 ms

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com