Autovacuum on partitioned table

Started by yuzukoabout 6 years ago108 messages
#1yuzuko
yuzukohosoya@gmail.com
1 attachment(s)

Hello,

Greg reported in [1]/messages/by-id/CAM-w4HMQKC8hw7nB9TW3OV+hkB5OUcPtvr_U_EiSOjByoa-e4Q@mail.gmail.com before, autovacuum ignores partitioned tables.
That is, even if individual partitions’ statistics are updated, its parent's
statistics are not updated. This is TODO for declarative partitioning.
As Amit mentioned in [2]/messages/by-id/CA+HiwqEeZQ-H2OVbHZ=n2RNNPF84Hygi1HC-MDwC-VnBjpA1=Q@mail.gmail.com, a way to make parent's statistics from
partitions' statistics without scanning the partitions would be nice,
but it will need a lot of modifications. So I tried to fix that using the
current analyze method.

The summary of the attached patch is as follows:
* If the relation is a partitioned table, check its children if they need
vacuum or analyze. Children need to do that are added to
a table list for autovacuuum. At least one child is added to the list,
the partitioned table is also added to the list. Then, autovacuum
runs on all the tables in the list.
* If the partitioned table has foreign partitions, ignore them.

When the parent has children don't need vacuum/analyze or foreign
partitions, parent's stats are updated scanning the current data of all
children, so old stats and new are mixed within the partition tree.
Is that suitable? Any thoughts?

[1]: /messages/by-id/CAM-w4HMQKC8hw7nB9TW3OV+hkB5OUcPtvr_U_EiSOjByoa-e4Q@mail.gmail.com
[2]: /messages/by-id/CA+HiwqEeZQ-H2OVbHZ=n2RNNPF84Hygi1HC-MDwC-VnBjpA1=Q@mail.gmail.com

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v1_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v1_autovacuum_on_partitioned_table.patchDownload
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c1dd8168ca..d4ce99a9f5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2036,11 +2037,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2063,7 +2064,12 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Collect partitioned tables, not partitions.  So skip them. */
+		if (classForm->relispartition)
 			continue;
 
 		relid = classForm->oid;
@@ -2092,19 +2098,76 @@ do_autovacuum(void)
 			continue;
 		}
 
-		/* Fetch reloptions and the pgstat entry for this table */
-		relopts = extract_autovac_opts(tuple, pg_class_desc);
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
+		if (classForm->relkind == RELKIND_RELATION ||
+			classForm->relkind == RELKIND_MATVIEW)
+		{
+			/* Fetch reloptions and the pgstat entry for this table */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
+
+			/* Check if it needs vacuum or analyze */
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* Relations that need work are added to table_oids */
+			if (dovacuum || doanalyze)
+				table_oids = lappend_oid(table_oids, relid);
+		}
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we check if its children
+			 * need vacuum or analyze.  All children excluding foreign partitions
+			 * need to do that are added to the table_oids list.  At least one 
+			 * child is added the list, the partitioned table become an object
+			 * for autovacuum. 
+			 */
+			List     *tableOIDs;
+			ListCell *lc;
+			List     *child_oids = NIL;
 
-		/* Check if it needs vacuum or analyze */
-		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+			/* Find all members of inheritance set taking AccessShareLock */
+			tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
 
-		/* Relations that need work are added to table_oids */
-		if (dovacuum || doanalyze)
-			table_oids = lappend_oid(table_oids, relid);
+			foreach(lc, tableOIDs)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclassForm;
+
+				/* Ignore the parent table */
+				if (childOID == relid)
+					continue;
+
+				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+				childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip foreign partitions */
+				if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+					continue;
+
+				/* Fetch reloptions and the pgstat entry for this table */
+				relopts = extract_autovac_opts(childtuple, pg_class_desc);
+				tabentry = get_pgstat_tabentry_relid(childOID,
+													 childclassForm->relisshared,
+													 shared, dbentry);
+
+				relation_needs_vacanalyze(childOID, relopts, childclassForm, tabentry,
+										  effective_multixact_freeze_max_age,
+										  &dovacuum, &doanalyze, &wraparound);
+
+				if (dovacuum || doanalyze)
+					child_oids = lappend_oid(child_oids, childOID);
+			}
+
+			if (child_oids)
+			{
+				table_oids = lappend_oid(table_oids, relid);
+				table_oids = list_concat(table_oids, child_oids);
+			}
+		}
 
 		/*
 		 * Remember TOAST associations for the second pass.  Note: we must do
@@ -2725,6 +2788,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2798,33 +2862,79 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		return NULL;
 	classForm = (Form_pg_class) GETSTRUCT(classTup);
 
-	/*
-	 * Get the applicable reloptions.  If it is a TOAST table, try to get the
-	 * main table reloptions if the toast table itself doesn't have.
-	 */
-	avopts = extract_autovac_opts(classTup, pg_class_desc);
-	if (classForm->relkind == RELKIND_TOASTVALUE &&
-		avopts == NULL && table_toast_map != NULL)
+	if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
 	{
-		av_relation *hentry;
-		bool		found;
+		/*
+		 * Get the applicable reloptions.  If it is a TOAST table, try to get the
+		 * main table reloptions if the toast table itself doesn't have.
+		 */
+		avopts = extract_autovac_opts(classTup, pg_class_desc);
+		if (classForm->relkind == RELKIND_TOASTVALUE &&
+			avopts == NULL && table_toast_map != NULL)
+		{
+			av_relation *hentry;
+			bool		found;
+
+			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+			if (found && hentry->ar_hasrelopts)
+				avopts = &hentry->ar_reloptions;
+		}
+
+		/* fetch the pgstat table entry */
+		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+											 shared, dbentry);
+
+		relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
+								  effective_multixact_freeze_max_age,
+								  &dovacuum, &doanalyze, &wraparound);
 
-		hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-		if (found && hentry->ar_hasrelopts)
-			avopts = &hentry->ar_reloptions;
+		/* ignore ANALYZE for toast tables */
+		if (classForm->relkind == RELKIND_TOASTVALUE)
+			doanalyze = false;
 	}
+	else
+	{
+		/* 
+		 * If the relation is partitioned and doesn't have any foreign tables
+		 * we check its children again.
+		 */
+		List     *tableOIDs;
+		ListCell *lc;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, tableOIDs)
+		{
+			Oid       childOID = lfirst_oid(lc);
+			HeapTuple childtuple;
+			Form_pg_class childclassForm;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
 
-	/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+			/* Fetch reloptions and the pgstat entry */
+			avopts = extract_autovac_opts(childtuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(childOID, childclassForm->relisshared,
+												 shared, dbentry);
 
-	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
-							  effective_multixact_freeze_max_age,
-							  &dovacuum, &doanalyze, &wraparound);
+			relation_needs_vacanalyze(childOID, avopts, childclassForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
 
-	/* ignore ANALYZE for toast tables */
-	if (classForm->relkind == RELKIND_TOASTVALUE)
-		doanalyze = false;
+			/* Its parents need vacuum or analyze */
+			if (dovacuum || doanalyze)
+				break;
+		}
+	}
 
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: yuzuko (#1)
Re: Autovacuum on partitioned table

On Mon, 2019-12-02 at 18:02 +0900, yuzuko wrote:

Greg reported in [1] before, autovacuum ignores partitioned tables.
That is, even if individual partitions’ statistics are updated, its parent's
statistics are not updated. This is TODO for declarative partitioning.
As Amit mentioned in [2], a way to make parent's statistics from
partitions' statistics without scanning the partitions would be nice,
but it will need a lot of modifications. So I tried to fix that using the
current analyze method.

The summary of the attached patch is as follows:
* If the relation is a partitioned table, check its children if they need
vacuum or analyze. Children need to do that are added to
a table list for autovacuuum. At least one child is added to the list,
the partitioned table is also added to the list. Then, autovacuum
runs on all the tables in the list.

That means that all partitions are vacuumed if only one of them needs it,
right? This will result in way more vacuuming than necessary.

Wouldn't it be an option to update the partitioned table's statistics
whenever one of the partitions is vacuumed?

Yours,
Laurenz Albe

#3yuzuko
yuzukohosoya@gmail.com
In reply to: Laurenz Albe (#2)
Re: Autovacuum on partitioned table

Hi Laurenz,

Thanks for the comments.

On Mon, Dec 2, 2019 at 6:19 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Mon, 2019-12-02 at 18:02 +0900, yuzuko wrote:

Greg reported in [1] before, autovacuum ignores partitioned tables.
That is, even if individual partitions’ statistics are updated, its parent's
statistics are not updated. This is TODO for declarative partitioning.
As Amit mentioned in [2], a way to make parent's statistics from
partitions' statistics without scanning the partitions would be nice,
but it will need a lot of modifications. So I tried to fix that using the
current analyze method.

The summary of the attached patch is as follows:
* If the relation is a partitioned table, check its children if they need
vacuum or analyze. Children need to do that are added to
a table list for autovacuuum. At least one child is added to the list,
the partitioned table is also added to the list. Then, autovacuum
runs on all the tables in the list.

That means that all partitions are vacuumed if only one of them needs it,
right? This will result in way more vacuuming than necessary.

Autovacuum runs only partitions need vacuum/analyze, so unnecessary
partitions stats are not updated. However, to make parent's stats,
all children are scanned. It might be a waste of time.

Wouldn't it be an option to update the partitioned table's statistics
whenever one of the partitions is vacuumed?

Yours,
Laurenz Albe

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#4yuzuko
yuzukohosoya@gmail.com
In reply to: yuzuko (#3)
1 attachment(s)
Re: Autovacuum on partitioned table

Hi,

As Laurenz commented in this thread, I tried adding option
to update parent's statistics during Autovacuum. To do that,
I propose supporting 'autovacuum_enabled' option already
exists on partitioned tables.

In the attached patch, you can use 'autovacuum_enabled' option
on partitioned table as usual, that is, a default value of this option
is true. So if you don't need autovacuum on a partitioned table,
you have to specify the option:
CREATE TABLE p(i int) partition by range(i) with (autovacuum_enabled=0);

I'm not sure but I wonder if a suitable value as a default of
'autovacuum_enabled' for partitioned tables might be false.
Because autovacuum on *partitioned tables* requires scanning
all children to make partitioned tables' statistics.
But if the default value varies according to the relation,
is it confusing? Any thoughts?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v2_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v2_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e9a31aa257..8d80efd445 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1308,8 +1308,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..d640aa4bda 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -1586,13 +1586,19 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+	static const relopt_parse_elt tab[] = {
+		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
+		 offsetof(PartitionedTableOptions, autovacuum_enabled)}
+	};
+
 	/*
 	 * There are no options for partitioned tables yet, but this is able to do
 	 * some validation.
 	 */
 	return (bytea *) build_reloptions(reloptions, validate,
 									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+									  sizeof(PartitionedTableOptions),
+									  tab, lengthof(tab));
 }
 
 /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c1dd8168ca..7bd4950d90 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2036,11 +2037,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2063,7 +2064,12 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Collect partitioned tables, not partitions.  So skip them. */
+		if (classForm->relispartition)
 			continue;
 
 		relid = classForm->oid;
@@ -2092,19 +2098,85 @@ do_autovacuum(void)
 			continue;
 		}
 
-		/* Fetch reloptions and the pgstat entry for this table */
-		relopts = extract_autovac_opts(tuple, pg_class_desc);
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
+		if (classForm->relkind == RELKIND_RELATION ||
+			classForm->relkind == RELKIND_MATVIEW)
+		{
+			/* Fetch reloptions and the pgstat entry for this table */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
+
+			/* Check if it needs vacuum or analyze */
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* Relations that need work are added to table_oids */
+			if (dovacuum || doanalyze)
+				table_oids = lappend_oid(table_oids, relid);
+		}
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we check if its children
+			 * need vacuum or analyze.  All children excluding foreign partitions
+			 * need to do that are added to the table_oids list.  At least one 
+			 * child is added the list, the partitioned table become an object
+			 * for autovacuum. 
+			 */
+			List     *tableOIDs;
+			ListCell *lc;
+			List     *child_oids = NIL;
+			bool      av_enabled;
 
-		/* Check if it needs vacuum or analyze */
-		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+			/*
+			 * Fetch reloptions and check whether partitioned table needs
+			 * autovacuum or not.
+			 */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			av_enabled = (relopts ? relopts->enabled : true);
 
-		/* Relations that need work are added to table_oids */
-		if (dovacuum || doanalyze)
-			table_oids = lappend_oid(table_oids, relid);
+			/* Find all members of inheritance set taking AccessShareLock */
+			tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, tableOIDs)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclassForm;
+
+				/* Ignore the parent table */
+				if (childOID == relid)
+					continue;
+
+				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+				childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip foreign partitions */
+				if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+					continue;
+
+				/* Fetch reloptions and the pgstat entry for this table */
+				relopts = extract_autovac_opts(childtuple, pg_class_desc);
+				tabentry = get_pgstat_tabentry_relid(childOID,
+													 childclassForm->relisshared,
+													 shared, dbentry);
+
+				relation_needs_vacanalyze(childOID, relopts, childclassForm, tabentry,
+										  effective_multixact_freeze_max_age,
+										  &dovacuum, &doanalyze, &wraparound);
+
+				if (dovacuum || doanalyze)
+					child_oids = lappend_oid(child_oids, childOID);
+			}
+
+			if (child_oids)
+			{
+				if (av_enabled)
+					table_oids = lappend_oid(table_oids, relid);
+				table_oids = list_concat(table_oids, child_oids);
+			}
+		}
 
 		/*
 		 * Remember TOAST associations for the second pass.  Note: we must do
@@ -2725,6 +2797,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2798,33 +2871,79 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		return NULL;
 	classForm = (Form_pg_class) GETSTRUCT(classTup);
 
-	/*
-	 * Get the applicable reloptions.  If it is a TOAST table, try to get the
-	 * main table reloptions if the toast table itself doesn't have.
-	 */
-	avopts = extract_autovac_opts(classTup, pg_class_desc);
-	if (classForm->relkind == RELKIND_TOASTVALUE &&
-		avopts == NULL && table_toast_map != NULL)
+	if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
 	{
-		av_relation *hentry;
-		bool		found;
+		/*
+		 * Get the applicable reloptions.  If it is a TOAST table, try to get the
+		 * main table reloptions if the toast table itself doesn't have.
+		 */
+		avopts = extract_autovac_opts(classTup, pg_class_desc);
+		if (classForm->relkind == RELKIND_TOASTVALUE &&
+			avopts == NULL && table_toast_map != NULL)
+		{
+			av_relation *hentry;
+			bool		found;
+
+			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+			if (found && hentry->ar_hasrelopts)
+				avopts = &hentry->ar_reloptions;
+		}
+
+		/* fetch the pgstat table entry */
+		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+											 shared, dbentry);
 
-		hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-		if (found && hentry->ar_hasrelopts)
-			avopts = &hentry->ar_reloptions;
+		relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
+								  effective_multixact_freeze_max_age,
+								  &dovacuum, &doanalyze, &wraparound);
+
+		/* ignore ANALYZE for toast tables */
+		if (classForm->relkind == RELKIND_TOASTVALUE)
+			doanalyze = false;
 	}
+	else
+	{
+		/* 
+		 * If the relation is partitioned and doesn't have any foreign tables
+		 * we check its children again.
+		 */
+		List     *tableOIDs;
+		ListCell *lc;
 
-	/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+		/* Find all members of inheritance set taking AccessShareLock */
+		tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, tableOIDs)
+		{
+			Oid       childOID = lfirst_oid(lc);
+			HeapTuple childtuple;
+			Form_pg_class childclassForm;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
 
-	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
-							  effective_multixact_freeze_max_age,
-							  &dovacuum, &doanalyze, &wraparound);
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
 
-	/* ignore ANALYZE for toast tables */
-	if (classForm->relkind == RELKIND_TOASTVALUE)
-		doanalyze = false;
+			/* Skip foreign partitions */
+			if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Fetch reloptions and the pgstat entry */
+			avopts = extract_autovac_opts(childtuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(childOID, childclassForm->relisshared,
+												 shared, dbentry);
+
+			relation_needs_vacanalyze(childOID, avopts, childclassForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* Its parents need vacuum or analyze */
+			if (dovacuum || doanalyze)
+				break;
+		}
+	}
 
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..d007d7feab 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -326,6 +326,16 @@ typedef struct StdRdOptions
 	((relation)->rd_options ? \
 	 ((StdRdOptions *) (relation)->rd_options)->parallel_workers : (defaultpw))
 
+/*
+ * PartitionedTableOptions
+ *		Contents of rd_options for partitioned tables
+ */
+typedef struct PartitionedTableOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	bool        autovacuum_enabled;
+} PartitionedTableOptions;
+
 /* ViewOptions->check_option values */
 typedef enum ViewOptCheckOption
 {
#5Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: yuzuko (#4)
Re: Autovacuum on partitioned table

On Fri, 27 Dec 2019 at 12:37, yuzuko <yuzukohosoya@gmail.com> wrote:

Hi,

As Laurenz commented in this thread, I tried adding option
to update parent's statistics during Autovacuum. To do that,
I propose supporting 'autovacuum_enabled' option already
exists on partitioned tables.

In the attached patch, you can use 'autovacuum_enabled' option
on partitioned table as usual, that is, a default value of this option
is true. So if you don't need autovacuum on a partitioned table,
you have to specify the option:
CREATE TABLE p(i int) partition by range(i) with (autovacuum_enabled=0);

I'm not sure but I wonder if a suitable value as a default of
'autovacuum_enabled' for partitioned tables might be false.
Because autovacuum on *partitioned tables* requires scanning
all children to make partitioned tables' statistics.
But if the default value varies according to the relation,
is it confusing? Any thoughts?

I don't look at the patch deeply yet but your patch seems to attempt
to vacuum on partitioned table. IIUC partitioned tables don't need to
be vacuumed and its all child tables are vacuumed instead if we pass
the partitioned table to vacuum() function. But autovacuum on child
tables is normally triggered since their statistics are updated.

I think it's a good idea to have that option but I think that doing
autovacuum on the parent table every time when autovacuum is triggered
on one of its child tables is very high cost especially when there are
a lot of child tables. Instead I thought it's more straight forward if
we compare the summation of the statistics of child tables (e.g.
n_live_tuples, n_dead_tuples etc) to vacuum thresholds when we
consider the needs of autovacuum on the parent table. What do you
think?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Amit Langote
amitlangote09@gmail.com
In reply to: Masahiko Sawada (#5)
Re: Autovacuum on partitioned table

Hello,

On Fri, Dec 27, 2019 at 2:02 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 27 Dec 2019 at 12:37, yuzuko <yuzukohosoya@gmail.com> wrote:

As Laurenz commented in this thread, I tried adding option
to update parent's statistics during Autovacuum. To do that,
I propose supporting 'autovacuum_enabled' option already
exists on partitioned tables.

In the attached patch, you can use 'autovacuum_enabled' option
on partitioned table as usual, that is, a default value of this option
is true. So if you don't need autovacuum on a partitioned table,
you have to specify the option:
CREATE TABLE p(i int) partition by range(i) with (autovacuum_enabled=0);

I'm not sure but I wonder if a suitable value as a default of
'autovacuum_enabled' for partitioned tables might be false.
Because autovacuum on *partitioned tables* requires scanning
all children to make partitioned tables' statistics.
But if the default value varies according to the relation,
is it confusing? Any thoughts?

I don't look at the patch deeply yet but your patch seems to attempt
to vacuum on partitioned table. IIUC partitioned tables don't need to
be vacuumed and its all child tables are vacuumed instead if we pass
the partitioned table to vacuum() function. But autovacuum on child
tables is normally triggered since their statistics are updated.

I think it's a good idea to have that option but I think that doing
autovacuum on the parent table every time when autovacuum is triggered
on one of its child tables is very high cost especially when there are
a lot of child tables. Instead I thought it's more straight forward if
we compare the summation of the statistics of child tables (e.g.
n_live_tuples, n_dead_tuples etc) to vacuum thresholds when we
consider the needs of autovacuum on the parent table. What do you
think?

There's this old email where Tom outlines a few ideas about triggering
auto-analyze on inheritance trees:

/messages/by-id/4823.1262132964@sss.pgh.pa.us

If I'm reading that correctly, the idea is to track only
changes_since_analyze and none of the finer-grained stats like
live/dead tuples for inheritance parents (partitioned tables) using
some new pgstat infrastrcture, an idea that Hosoya-san also seems to
be considering per an off-list discussion. Besides the complexity of
getting that infrastructure in place, an important question is whether
the current system of applying threshold and scale factor to
changes_since_analyze should be used as-is for inheritance parents
(partitioned tables), because if users set those parameters similarly
to for regular tables, autovacuum might analyze partitioned tables
more than necessary. We'll either need a different formula, or some
commentary in the documentation about how partitioned tables might
need different setting, or maybe both.

By the way, maybe I'm misunderstanding what Sawada-san wrote above,
but the only missing piece seems to be a way to trigger an *analyze*
on the parent tables -- to collect optimizer statistics for the
inheritance trees -- not vacuum, for which the existing system seems
enough.

Thanks,
Amit

#7Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Langote (#6)
Re: Autovacuum on partitioned table

On Tue, 28 Jan 2020 at 17:52, Amit Langote <amitlangote09@gmail.com> wrote:

Hello,

On Fri, Dec 27, 2019 at 2:02 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 27 Dec 2019 at 12:37, yuzuko <yuzukohosoya@gmail.com> wrote:

As Laurenz commented in this thread, I tried adding option
to update parent's statistics during Autovacuum. To do that,
I propose supporting 'autovacuum_enabled' option already
exists on partitioned tables.

In the attached patch, you can use 'autovacuum_enabled' option
on partitioned table as usual, that is, a default value of this option
is true. So if you don't need autovacuum on a partitioned table,
you have to specify the option:
CREATE TABLE p(i int) partition by range(i) with (autovacuum_enabled=0);

I'm not sure but I wonder if a suitable value as a default of
'autovacuum_enabled' for partitioned tables might be false.
Because autovacuum on *partitioned tables* requires scanning
all children to make partitioned tables' statistics.
But if the default value varies according to the relation,
is it confusing? Any thoughts?

I don't look at the patch deeply yet but your patch seems to attempt
to vacuum on partitioned table. IIUC partitioned tables don't need to
be vacuumed and its all child tables are vacuumed instead if we pass
the partitioned table to vacuum() function. But autovacuum on child
tables is normally triggered since their statistics are updated.

I think it's a good idea to have that option but I think that doing
autovacuum on the parent table every time when autovacuum is triggered
on one of its child tables is very high cost especially when there are
a lot of child tables. Instead I thought it's more straight forward if
we compare the summation of the statistics of child tables (e.g.
n_live_tuples, n_dead_tuples etc) to vacuum thresholds when we
consider the needs of autovacuum on the parent table. What do you
think?

There's this old email where Tom outlines a few ideas about triggering
auto-analyze on inheritance trees:

/messages/by-id/4823.1262132964@sss.pgh.pa.us

If I'm reading that correctly, the idea is to track only
changes_since_analyze and none of the finer-grained stats like
live/dead tuples for inheritance parents (partitioned tables) using
some new pgstat infrastrcture, an idea that Hosoya-san also seems to
be considering per an off-list discussion. Besides the complexity of
getting that infrastructure in place, an important question is whether
the current system of applying threshold and scale factor to
changes_since_analyze should be used as-is for inheritance parents
(partitioned tables), because if users set those parameters similarly
to for regular tables, autovacuum might analyze partitioned tables
more than necessary.

How are you going to track changes_since_analyze of partitioned table?
It's just an idea but we can accumulate changes_since_analyze of
partitioned table by adding child tables's value after analyzing each
child table. And compare the partitioned tables value to the threshold
that is computed by (autovacuum_analyze_threshold + total rows
including all child tables * autovacuum_analyze_scale_factor).

By the way, maybe I'm misunderstanding what Sawada-san wrote above,
but the only missing piece seems to be a way to trigger an *analyze*
on the parent tables -- to collect optimizer statistics for the
inheritance trees -- not vacuum, for which the existing system seems
enough.

Right. We need only autoanalyze on partitioned tables.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8yuzuko
yuzukohosoya@gmail.com
In reply to: Masahiko Sawada (#7)
Re: Autovacuum on partitioned table

Hello,

Besides the complexity of
getting that infrastructure in place, an important question is whether
the current system of applying threshold and scale factor to
changes_since_analyze should be used as-is for inheritance parents
(partitioned tables), because if users set those parameters similarly
to for regular tables, autovacuum might analyze partitioned tables
more than necessary. We'll either need a different formula, or some
commentary in the documentation about how partitioned tables might
need different setting, or maybe both.

I'm not sure but I think we need new autovacuum parameters for
partitioned tables (autovacuum, autovacuum_analyze_threshold,
autovacuum_analyze_scale_factor) because whether it's necessary
to run autovacuum on partitioned tables will depend on users.
What do you think?

How are you going to track changes_since_analyze of partitioned table?
It's just an idea but we can accumulate changes_since_analyze of
partitioned table by adding child tables's value after analyzing each
child table. And compare the partitioned tables value to the threshold
that is computed by (autovacuum_analyze_threshold + total rows
including all child tables * autovacuum_analyze_scale_factor).

The idea Sawada-san mentioned is similar to mine. Also, for tracking
changes_since_analyze, we have to make partitioned table's statistics.
To do that, we can invent a new PgStat_StatPartitionedTabEntry based
on PgStat_StatTabEntry. Through talking with Amit, I think the new structure
needs the following members:

tableid
changes_since_analyze
analyze_timestamp
analyze_count
autovac_analyze_timestamp
autovac_analyze_count

Vacuum doesn't run on partitioned tables, so I think members related to
(auto) vacuum need not be contained in the structure.

I'm still writing a patch. I'll send it this week.
--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#9Amit Langote
amitlangote09@gmail.com
In reply to: yuzuko (#8)
Re: Autovacuum on partitioned table

On Wed, Jan 29, 2020 at 11:29 AM yuzuko <yuzukohosoya@gmail.com> wrote:

Besides the complexity of
getting that infrastructure in place, an important question is whether
the current system of applying threshold and scale factor to
changes_since_analyze should be used as-is for inheritance parents
(partitioned tables), because if users set those parameters similarly
to for regular tables, autovacuum might analyze partitioned tables
more than necessary. We'll either need a different formula, or some
commentary in the documentation about how partitioned tables might
need different setting, or maybe both.

I'm not sure but I think we need new autovacuum parameters for
partitioned tables (autovacuum, autovacuum_analyze_threshold,
autovacuum_analyze_scale_factor) because whether it's necessary
to run autovacuum on partitioned tables will depend on users.
What do you think?

Yes, we will need to first support those parameters on partitioned
tables. Currently, you get:

create table p (a int) partition by list (a) with
(autovacuum_analyze_scale_factor=0);
ERROR: unrecognized parameter "autovacuum_analyze_scale_factor"

How are you going to track changes_since_analyze of partitioned table?
It's just an idea but we can accumulate changes_since_analyze of
partitioned table by adding child tables's value after analyzing each
child table. And compare the partitioned tables value to the threshold
that is computed by (autovacuum_analyze_threshold + total rows
including all child tables * autovacuum_analyze_scale_factor).

The idea Sawada-san mentioned is similar to mine.

So if I understand this idea correctly, a partitioned table's analyze
will only be triggered when partitions are analyzed. That is,
inserts, updates, deletes of tuples in partitions will be tracked by
pgstat, which in turn is used by autovacuum to trigger analyze on
partitions. Then, partitions changes_since_analyze is added into the
parent's changes_since_analyze, which in turn *may* trigger analyze
parent. I said "may", because it would take multiple partition
analyzes to accumulate enough changes to trigger one on the parent.
Am I getting that right?

Also, for tracking
changes_since_analyze, we have to make partitioned table's statistics.
To do that, we can invent a new PgStat_StatPartitionedTabEntry based
on PgStat_StatTabEntry. Through talking with Amit, I think the new structure
needs the following members:

tableid
changes_since_analyze
analyze_timestamp
analyze_count
autovac_analyze_timestamp
autovac_analyze_count

Vacuum doesn't run on partitioned tables, so I think members related to
(auto) vacuum need not be contained in the structure.

On second thought, maybe we don't need a new PgStat_ struct. We can
just use what's used for regular tables and leave the fields that
don't make sense for partitioned tables set to 0, such as those that
track the counts of scans, tuples, etc. That means we don't have to
mess with interfaces of existing functions, like this one:

static void relation_needs_vacanalyze(Oid relid,
AutoVacOpts *relopts,
Form_pg_class classForm,
PgStat_StatTabEntry *tabentry, ...

Thanks,
Amit

#10Michael Paquier
michael@paquier.xyz
In reply to: Amit Langote (#9)
Re: Autovacuum on partitioned table

On Wed, Jan 29, 2020 at 05:56:40PM +0900, Amit Langote wrote:

Yes, we will need to first support those parameters on partitioned
tables. Currently, you get:

create table p (a int) partition by list (a) with
(autovacuum_analyze_scale_factor=0);
ERROR: unrecognized parameter "autovacuum_analyze_scale_factor"

Worth the note: partitioned tables support zero reloptions as of now,
but there is the facility in place to allow that (see
RELOPT_KIND_PARTITIONED and partitioned_table_reloptions).
--
Michael

#11Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Amit Langote (#9)
Re: Autovacuum on partitioned table

On Wed, 29 Jan 2020 at 17:56, Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Jan 29, 2020 at 11:29 AM yuzuko <yuzukohosoya@gmail.com> wrote:

Besides the complexity of
getting that infrastructure in place, an important question is whether
the current system of applying threshold and scale factor to
changes_since_analyze should be used as-is for inheritance parents
(partitioned tables), because if users set those parameters similarly
to for regular tables, autovacuum might analyze partitioned tables
more than necessary. We'll either need a different formula, or some
commentary in the documentation about how partitioned tables might
need different setting, or maybe both.

I'm not sure but I think we need new autovacuum parameters for
partitioned tables (autovacuum, autovacuum_analyze_threshold,
autovacuum_analyze_scale_factor) because whether it's necessary
to run autovacuum on partitioned tables will depend on users.
What do you think?

Yes, we will need to first support those parameters on partitioned
tables. Currently, you get:

create table p (a int) partition by list (a) with
(autovacuum_analyze_scale_factor=0);
ERROR: unrecognized parameter "autovacuum_analyze_scale_factor"

How are you going to track changes_since_analyze of partitioned table?
It's just an idea but we can accumulate changes_since_analyze of
partitioned table by adding child tables's value after analyzing each
child table. And compare the partitioned tables value to the threshold
that is computed by (autovacuum_analyze_threshold + total rows
including all child tables * autovacuum_analyze_scale_factor).

The idea Sawada-san mentioned is similar to mine.

So if I understand this idea correctly, a partitioned table's analyze
will only be triggered when partitions are analyzed. That is,
inserts, updates, deletes of tuples in partitions will be tracked by
pgstat, which in turn is used by autovacuum to trigger analyze on
partitions. Then, partitions changes_since_analyze is added into the
parent's changes_since_analyze, which in turn *may* trigger analyze
parent. I said "may", because it would take multiple partition
analyzes to accumulate enough changes to trigger one on the parent.
Am I getting that right?

Yeah that is what I meant. In addition, adding partition's
changes_since_analyze to its parent needs to be done recursively as
the parent table could also be a partitioned table.

Also, for tracking
changes_since_analyze, we have to make partitioned table's statistics.
To do that, we can invent a new PgStat_StatPartitionedTabEntry based
on PgStat_StatTabEntry. Through talking with Amit, I think the new structure
needs the following members:

tableid
changes_since_analyze
analyze_timestamp
analyze_count
autovac_analyze_timestamp
autovac_analyze_count

Vacuum doesn't run on partitioned tables, so I think members related to
(auto) vacuum need not be contained in the structure.

On second thought, maybe we don't need a new PgStat_ struct. We can
just use what's used for regular tables and leave the fields that
don't make sense for partitioned tables set to 0, such as those that
track the counts of scans, tuples, etc. That means we don't have to
mess with interfaces of existing functions, like this one:

static void relation_needs_vacanalyze(Oid relid,
AutoVacOpts *relopts,
Form_pg_class classForm,
PgStat_StatTabEntry *tabentry, ...

+1

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Amit Langote
amitlangote09@gmail.com
In reply to: Masahiko Sawada (#11)
Re: Autovacuum on partitioned table

On Sun, Feb 2, 2020 at 12:53 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 29 Jan 2020 at 17:56, Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Jan 29, 2020 at 11:29 AM yuzuko <yuzukohosoya@gmail.com> wrote:

How are you going to track changes_since_analyze of partitioned table?
It's just an idea but we can accumulate changes_since_analyze of
partitioned table by adding child tables's value after analyzing each
child table. And compare the partitioned tables value to the threshold
that is computed by (autovacuum_analyze_threshold + total rows
including all child tables * autovacuum_analyze_scale_factor).

The idea Sawada-san mentioned is similar to mine.

So if I understand this idea correctly, a partitioned table's analyze
will only be triggered when partitions are analyzed. That is,
inserts, updates, deletes of tuples in partitions will be tracked by
pgstat, which in turn is used by autovacuum to trigger analyze on
partitions. Then, partitions changes_since_analyze is added into the
parent's changes_since_analyze, which in turn *may* trigger analyze
parent. I said "may", because it would take multiple partition
analyzes to accumulate enough changes to trigger one on the parent.
Am I getting that right?

Yeah that is what I meant. In addition, adding partition's
changes_since_analyze to its parent needs to be done recursively as
the parent table could also be a partitioned table.

That's a good point. So, changes_since_analyze increments are
essentially propagated from leaf partitions to all the way up to the
root table, including any intermediate partitioned tables. We'll need
to consider whether we should propagate only one level at a time (from
bottom of the tree) or update all parents up to the root, every time a
leaf partition is analyzed. If we we do the latter, that might end up
triggering analyze on all the parents at the same time causing
repeated scanning of the same child tables in close intervals,
although setting analyze threshold and scale factor of the parent
tables of respective levels wisely can help avoid any negative impact
of that.

Thanks,
Amit

#13yuzuko
yuzukohosoya@gmail.com
In reply to: Amit Langote (#12)
1 attachment(s)
Re: Autovacuum on partitioned table

Hello,

I'm sorry for the delay.
Attach the latest patch based on discussion in this thread.

Yeah that is what I meant. In addition, adding partition's
changes_since_analyze to its parent needs to be done recursively as
the parent table could also be a partitioned table.

That's a good point. So, changes_since_analyze increments are
essentially propagated from leaf partitions to all the way up to the
root table, including any intermediate partitioned tables. We'll need
to consider whether we should propagate only one level at a time (from
bottom of the tree) or update all parents up to the root, every time a
leaf partition is analyzed.

For multi-level partitioning, all parents' changes_since_analyze will be
updated whenever analyzing a leaf partition in this patch.
Could you please check the patch again?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v3_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v3_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 4a2b6f0dae..88c635f82f 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1310,8 +1310,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..20183a96a4 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -227,7 +227,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -379,7 +379,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1586,13 +1586,12 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..161abb6450 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -584,7 +584,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -604,7 +604,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c4420ddd7f..df3d93ea5d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,15 +655,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  If the table is a
+	 * partition, report changes_since_analyze of its parent because
+	 * autovacuum process for partitioned tables needs it.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
-		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+	pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+						  (va_cols == NIL));
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..7d0a5ce30d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2031,11 +2032,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2058,7 +2059,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2087,19 +2089,103 @@ do_autovacuum(void)
 			continue;
 		}
 
-		/* Fetch reloptions and the pgstat entry for this table */
-		relopts = extract_autovac_opts(tuple, pg_class_desc);
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
+		if (classForm->relkind == RELKIND_RELATION ||
+			classForm->relkind == RELKIND_MATVIEW)
+		{
+			/* Fetch reloptions and the pgstat entry for this table */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
+
+			/* Check if it needs vacuum or analyze */
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* Relations that need work are added to table_oids */
+			if (dovacuum || doanalyze)
+				table_oids = lappend_oid(table_oids, relid);
+		}
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we check it using reltuples
+			 * added up childrens' and changes_since_analyze tracked by stats collector.
+			 * We check only auto analyze because partitioned tables don't need to vacuum.
+			 */
+			List     *tableOIDs;
+			ListCell *lc;
+			bool      av_enabled;
+			int       anl_base_thresh;
+			float4    all_reltuples = 0,
+			          anl_scale_factor,
+				      anlthresh,
+				      reltuples,
+				      anltuples;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, tableOIDs)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclassForm;
 
-		/* Check if it needs vacuum or analyze */
-		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+				/* Ignore the parent table */
+				if (childOID == relid)
+					continue;
 
-		/* Relations that need work are added to table_oids */
-		if (dovacuum || doanalyze)
-			table_oids = lappend_oid(table_oids, relid);
+				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+				childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip foreign partitions */
+				if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+					continue;
+
+				/* Sum up the child's reltuples for its parent table */
+				all_reltuples += childclassForm->reltuples;
+				elog(NOTICE, "[parent:%s] child%s has %.0f tuples", NameStr(classForm->relname),NameStr(childclassForm->relname), childclassForm->reltuples);
+			}
+
+
+			/* Fetch reloptions and the pgstat entry for the partitioned table */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			tabentry = get_pgstat_tabentry_relid(relid,
+												 classForm->relisshared,
+												 shared, dbentry);
+
+			/* Check if it needs auto analyze */
+			av_enabled = (relopts ? relopts->enabled : true);
+
+			if (av_enabled)
+			{
+				anl_scale_factor = (relopts && relopts->analyze_scale_factor >= 0)
+					? relopts->analyze_scale_factor
+					: autovacuum_anl_scale;
+
+				anl_base_thresh = (relopts && relopts->analyze_threshold >= 0)
+					? relopts->analyze_threshold
+					: autovacuum_anl_thresh;
+
+				elog(NOTICE, "[parent:%s] has %.0f tuples", NameStr(classForm->relname), all_reltuples);
+				if (PointerIsValid(tabentry))
+				{
+					reltuples = all_reltuples;
+					anltuples = tabentry->changes_since_analyze;
+					anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
+
+					elog(DEBUG3, "%s: anl: %.0f (threshold %.0f)",
+						 NameStr(classForm->relname),
+						 anltuples, anlthresh);
+
+					/* Determine if this table needs analyze. */
+					doanalyze = (anltuples > anlthresh);
+				}
+				if (doanalyze)
+					table_oids = lappend_oid(table_oids, relid);
+			}
+		}
 
 		/*
 		 * Remember TOAST associations for the second pass.  Note: we must do
@@ -2720,6 +2806,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2793,33 +2880,105 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		return NULL;
 	classForm = (Form_pg_class) GETSTRUCT(classTup);
 
-	/*
-	 * Get the applicable reloptions.  If it is a TOAST table, try to get the
-	 * main table reloptions if the toast table itself doesn't have.
-	 */
-	avopts = extract_autovac_opts(classTup, pg_class_desc);
-	if (classForm->relkind == RELKIND_TOASTVALUE &&
-		avopts == NULL && table_toast_map != NULL)
+	if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
 	{
-		av_relation *hentry;
-		bool		found;
+		/*
+		 * Get the applicable reloptions.  If it is a TOAST table, try to get the
+		 * main table reloptions if the toast table itself doesn't have.
+		 */
+		avopts = extract_autovac_opts(classTup, pg_class_desc);
+		if (classForm->relkind == RELKIND_TOASTVALUE &&
+			avopts == NULL && table_toast_map != NULL)
+		{
+			av_relation *hentry;
+			bool		found;
 
-		hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-		if (found && hentry->ar_hasrelopts)
-			avopts = &hentry->ar_reloptions;
+			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+			if (found && hentry->ar_hasrelopts)
+				avopts = &hentry->ar_reloptions;
+		}
+
+		/* fetch the pgstat table entry */
+		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+											 shared, dbentry);
+
+		relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
+								  effective_multixact_freeze_max_age,
+								  &dovacuum, &doanalyze, &wraparound);
+
+		/* ignore ANALYZE for toast tables */
+		if (classForm->relkind == RELKIND_TOASTVALUE)
+			doanalyze = false;
 	}
+	else
+	{
+		List     *tableOIDs;
+		ListCell *lc;
+		bool      av_enabled;
+		int       anl_base_thresh;
+		float4    all_reltuples,
+			      anl_scale_factor,
+			      anlthresh,
+			      reltuples,
+			      anltuples;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, tableOIDs)
+		{
+			Oid       childOID = lfirst_oid(lc);
+			HeapTuple childtuple;
+			Form_pg_class childclassForm;
 
-	/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
 
-	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
-							  effective_multixact_freeze_max_age,
-							  &dovacuum, &doanalyze, &wraparound);
+			/* Skip foreign partitions */
+			if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for the partitioned table */
+			all_reltuples += childclassForm->reltuples;
+		}
 
-	/* ignore ANALYZE for toast tables */
-	if (classForm->relkind == RELKIND_TOASTVALUE)
-		doanalyze = false;
+		/* Fetch reloptions and the pgstat entry for the partitioned table */
+		avopts = extract_autovac_opts(classTup, pg_class_desc);
+		tabentry = get_pgstat_tabentry_relid(relid,
+											 classForm->relisshared,
+											 shared, dbentry);
+
+		/* Check if it needs auto analyze */
+		av_enabled = (avopts ? avopts->enabled : true);
+
+		if (av_enabled)
+		{
+			anl_scale_factor = (avopts && avopts->analyze_scale_factor >= 0)
+				? avopts->analyze_scale_factor
+				: autovacuum_anl_scale;
+
+			anl_base_thresh = (avopts && avopts->analyze_threshold >= 0)
+				? avopts->analyze_threshold
+				: autovacuum_anl_thresh;
+
+			if (PointerIsValid(tabentry))
+			{
+				reltuples = all_reltuples;
+				anltuples = tabentry->changes_since_analyze;
+				anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
+
+				elog(DEBUG3, "%s: anl: %.0f (threshold %.0f)",
+					 NameStr(classForm->relname),
+					 anltuples, anlthresh);
+
+				doanalyze = (anltuples > anlthresh);
+			}
+		}
+	}
 
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f31ab..1933da145a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -67,6 +68,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 /* ----------
@@ -322,6 +324,7 @@ static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, in
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1463,6 +1466,32 @@ pgstat_report_analyze(Relation rel,
 		deadtuples = Max(deadtuples, 0);
 	}
 
+	/*
+	 * If the table is a leaf partition, tell the stats collector its parent's
+	 * changes_since_analyze for auto analyze
+	 */
+	if (rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	{
+		Oid      parentoid;
+		Relation parentrel;
+		PgStat_StatDBEntry *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, tabentry->changes_since_analyze);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
@@ -1475,6 +1504,49 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partanalyze() -
+ *
+ *	Tell the collector about the parent table of which partition just analyzed.
+ *
+ * Caller must provide a child's changes_since_analyze as a parents.
+ * --------
+ */
+void
+pgstat_report_partanalyze(Relation rel, PgStat_Counter changes_tuples)
+{
+	PgStat_MsgPartAnalyze msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	/*
+	 * If the partitioned table is also a partition, tell the stats collector
+	 * its parent's changes_since_analyze for auto analyze
+	 */
+	if (rel->rd_rel->relispartition)
+	{
+		Oid      parentoid;
+		Relation parentrel;
+		
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, changes_tuples);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTANALYZE);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changes_tuples = changes_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1749,6 +1821,7 @@ pgstat_initstats(Relation rel)
 	/* We only count stats for things that have storage */
 	if (!(relkind == RELKIND_RELATION ||
 		  relkind == RELKIND_MATVIEW ||
+		  relkind == RELKIND_PARTITIONED_TABLE ||
 		  relkind == RELKIND_INDEX ||
 		  relkind == RELKIND_TOASTVALUE ||
 		  relkind == RELKIND_SEQUENCE))
@@ -4592,6 +4665,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTANALYZE:
+					pgstat_recv_partanalyze(&msg.msg_partanalyze, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6239,6 +6316,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changes_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..590885d7e8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -57,6 +57,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTANALYZE,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -389,6 +390,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartAnalyze		Sent by the backend or autovacuum daemon
+ *                              after ANALYZE for partitioned tables
+ * ----------
+ */
+typedef struct PgStat_MsgPartAnalyze
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changes_tuples;
+} PgStat_MsgPartAnalyze;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -562,6 +575,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartAnalyze msg_partanalyze;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -1267,7 +1281,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partanalyze(Relation rel, PgStat_Counter changes_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..7e9f6de9cb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1789,7 +1789,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2117,7 +2117,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#14Amit Langote
amitlangote09@gmail.com
In reply to: yuzuko (#13)
Re: Autovacuum on partitioned table

Hosoya-san,

On Thu, Feb 20, 2020 at 3:34 PM yuzuko <yuzukohosoya@gmail.com> wrote:

Attach the latest patch based on discussion in this thread.

Yeah that is what I meant. In addition, adding partition's
changes_since_analyze to its parent needs to be done recursively as
the parent table could also be a partitioned table.

That's a good point. So, changes_since_analyze increments are
essentially propagated from leaf partitions to all the way up to the
root table, including any intermediate partitioned tables. We'll need
to consider whether we should propagate only one level at a time (from
bottom of the tree) or update all parents up to the root, every time a
leaf partition is analyzed.

For multi-level partitioning, all parents' changes_since_analyze will be
updated whenever analyzing a leaf partition in this patch.
Could you please check the patch again?

Thank you for the new patch.

I built and confirmed that the patch works.

Here are some comments:

* White-space noise in the diff (space used where tab is expected);
please check with git diff --check and fix.

* Names changes_tuples, m_changes_tuples should be changed_tuples and
m_changed_tuples, respectively?

* Did you intend to make it so that we now report *all* inherited
stats to the stats collector, not just those for partitioned tables?
IOW, do did you intend the new feature to also cover traditional
inheritance parents? I am talking about the following diff:

     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
-     * auto-analyze to do.
+     * Report ANALYZE to the stats collector, too.  If the table is a
+     * partition, report changes_since_analyze of its parent because
+     * autovacuum process for partitioned tables needs it.  Reset the
+     * changes_since_analyze counter only if we analyzed all columns;
+     * otherwise, there is still work for auto-analyze to do.
      */
-    if (!inh)
-        pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-                              (va_cols == NIL));
+    pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+                          (va_cols == NIL));

* I may be missing something, but why doesn't do_autovacuum() fetch a
partitioned table's entry from pgstat instead of fetching that for
individual children and adding? That is, why do we need to do the
following:

+            /*
+             * If the relation is a partitioned table, we check it
using reltuples
+             * added up childrens' and changes_since_analyze tracked
by stats collector.

More later...

Thanks,
Amit

#15Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#14)
Re: Autovacuum on partitioned table

On Thu, Feb 20, 2020 at 4:50 PM Amit Langote <amitlangote09@gmail.com> wrote:

* I may be missing something, but why doesn't do_autovacuum() fetch a
partitioned table's entry from pgstat instead of fetching that for
individual children and adding? That is, why do we need to do the
following:

+            /*
+             * If the relation is a partitioned table, we check it
using reltuples
+             * added up childrens' and changes_since_analyze tracked
by stats collector.

Oh, it's only adding up children's pg_class.reltuple, not pgstat
stats. We need to do that because a partitioned table's
pg_class.reltuples is always 0 and correctly so. Sorry for not
reading the patch properly.

Thanks,
Amit

#16Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#15)
1 attachment(s)
Re: Autovacuum on partitioned table

On Thu, Feb 20, 2020 at 5:32 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Feb 20, 2020 at 4:50 PM Amit Langote <amitlangote09@gmail.com> wrote:

* I may be missing something, but why doesn't do_autovacuum() fetch a
partitioned table's entry from pgstat instead of fetching that for
individual children and adding? That is, why do we need to do the
following:

+            /*
+             * If the relation is a partitioned table, we check it
using reltuples
+             * added up childrens' and changes_since_analyze tracked
by stats collector.

Oh, it's only adding up children's pg_class.reltuple, not pgstat
stats. We need to do that because a partitioned table's
pg_class.reltuples is always 0 and correctly so. Sorry for not
reading the patch properly.

Having read the relevant diffs again, I think this could be done
without duplicating code too much. You seem to have added the same
logic in two places: do_autovacuum() and table_recheck_autovac().
More importantly, part of the logic of relation_needs_vacanalyze() is
duplicated in both of the aforementioned places, which I think is
unnecessary and undesirable if you consider maintainability. I think
we could just add the logic to compute reltuples for partitioned
tables at the beginning of relation_needs_vacanalyze() and be done. I
have attached a delta patch to show what I mean. Please check and
tell what you think.

Thanks,
Amit

Attachments:

v3_amit_delta.patchtext/plain; charset=US-ASCII; name=v3_amit_delta.patchDownload
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7d0a5ce30d..ca6996e448 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2089,103 +2089,19 @@ do_autovacuum(void)
 			continue;
 		}
 
-		if (classForm->relkind == RELKIND_RELATION ||
-			classForm->relkind == RELKIND_MATVIEW)
-		{
-			/* Fetch reloptions and the pgstat entry for this table */
-			relopts = extract_autovac_opts(tuple, pg_class_desc);
-			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-												 shared, dbentry);
-
-			/* Check if it needs vacuum or analyze */
-			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-									  effective_multixact_freeze_max_age,
-									  &dovacuum, &doanalyze, &wraparound);
-
-			/* Relations that need work are added to table_oids */
-			if (dovacuum || doanalyze)
-				table_oids = lappend_oid(table_oids, relid);
-		}
-		else
-		{
-			/*
-			 * If the relation is a partitioned table, we check it using reltuples
-			 * added up childrens' and changes_since_analyze tracked by stats collector.
-			 * We check only auto analyze because partitioned tables don't need to vacuum.
-			 */
-			List     *tableOIDs;
-			ListCell *lc;
-			bool      av_enabled;
-			int       anl_base_thresh;
-			float4    all_reltuples = 0,
-			          anl_scale_factor,
-				      anlthresh,
-				      reltuples,
-				      anltuples;
-
-			/* Find all members of inheritance set taking AccessShareLock */
-			tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
-
-			foreach(lc, tableOIDs)
-			{
-				Oid        childOID = lfirst_oid(lc);
-				HeapTuple  childtuple;
-				Form_pg_class childclassForm;
-
-				/* Ignore the parent table */
-				if (childOID == relid)
-					continue;
-
-				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
-				childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
-
-				/* Skip foreign partitions */
-				if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
-					continue;
-
-				/* Sum up the child's reltuples for its parent table */
-				all_reltuples += childclassForm->reltuples;
-				elog(NOTICE, "[parent:%s] child%s has %.0f tuples", NameStr(classForm->relname),NameStr(childclassForm->relname), childclassForm->reltuples);
-			}
-
-
-			/* Fetch reloptions and the pgstat entry for the partitioned table */
-			relopts = extract_autovac_opts(tuple, pg_class_desc);
-			tabentry = get_pgstat_tabentry_relid(relid,
-												 classForm->relisshared,
-												 shared, dbentry);
-
-			/* Check if it needs auto analyze */
-			av_enabled = (relopts ? relopts->enabled : true);
-
-			if (av_enabled)
-			{
-				anl_scale_factor = (relopts && relopts->analyze_scale_factor >= 0)
-					? relopts->analyze_scale_factor
-					: autovacuum_anl_scale;
-
-				anl_base_thresh = (relopts && relopts->analyze_threshold >= 0)
-					? relopts->analyze_threshold
-					: autovacuum_anl_thresh;
-
-				elog(NOTICE, "[parent:%s] has %.0f tuples", NameStr(classForm->relname), all_reltuples);
-				if (PointerIsValid(tabentry))
-				{
-					reltuples = all_reltuples;
-					anltuples = tabentry->changes_since_analyze;
-					anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
+		/* Fetch reloptions and the pgstat entry for this table */
+		relopts = extract_autovac_opts(tuple, pg_class_desc);
+		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+											 shared, dbentry);
 
-					elog(DEBUG3, "%s: anl: %.0f (threshold %.0f)",
-						 NameStr(classForm->relname),
-						 anltuples, anlthresh);
+		/* Check if it needs vacuum or analyze */
+		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+								  effective_multixact_freeze_max_age,
+								  &dovacuum, &doanalyze, &wraparound);
 
-					/* Determine if this table needs analyze. */
-					doanalyze = (anltuples > anlthresh);
-				}
-				if (doanalyze)
-					table_oids = lappend_oid(table_oids, relid);
-			}
-		}
+		/* Relations that need work are added to table_oids */
+		if (dovacuum || doanalyze)
+			table_oids = lappend_oid(table_oids, relid);
 
 		/*
 		 * Remember TOAST associations for the second pass.  Note: we must do
@@ -2880,105 +2796,33 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		return NULL;
 	classForm = (Form_pg_class) GETSTRUCT(classTup);
 
-	if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+	/*
+	 * Get the applicable reloptions.  If it is a TOAST table, try to get the
+	 * main table reloptions if the toast table itself doesn't have.
+	 */
+	avopts = extract_autovac_opts(classTup, pg_class_desc);
+	if (classForm->relkind == RELKIND_TOASTVALUE &&
+		avopts == NULL && table_toast_map != NULL)
 	{
-		/*
-		 * Get the applicable reloptions.  If it is a TOAST table, try to get the
-		 * main table reloptions if the toast table itself doesn't have.
-		 */
-		avopts = extract_autovac_opts(classTup, pg_class_desc);
-		if (classForm->relkind == RELKIND_TOASTVALUE &&
-			avopts == NULL && table_toast_map != NULL)
-		{
-			av_relation *hentry;
-			bool		found;
-
-			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-			if (found && hentry->ar_hasrelopts)
-				avopts = &hentry->ar_reloptions;
-		}
-
-		/* fetch the pgstat table entry */
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
-
-		relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+		av_relation *hentry;
+		bool		found;
 
-		/* ignore ANALYZE for toast tables */
-		if (classForm->relkind == RELKIND_TOASTVALUE)
-			doanalyze = false;
+		hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+		if (found && hentry->ar_hasrelopts)
+			avopts = &hentry->ar_reloptions;
 	}
-	else
-	{
-		List     *tableOIDs;
-		ListCell *lc;
-		bool      av_enabled;
-		int       anl_base_thresh;
-		float4    all_reltuples,
-			      anl_scale_factor,
-			      anlthresh,
-			      reltuples,
-			      anltuples;
 
-		/* Find all members of inheritance set taking AccessShareLock */
-		tableOIDs = find_all_inheritors(relid, AccessShareLock, NULL);
+	/* fetch the pgstat table entry */
+	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+										 shared, dbentry);
 
-		foreach(lc, tableOIDs)
-		{
-			Oid       childOID = lfirst_oid(lc);
-			HeapTuple childtuple;
-			Form_pg_class childclassForm;
+	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
+							  effective_multixact_freeze_max_age,
+							  &dovacuum, &doanalyze, &wraparound);
 
-			/* Ignore the parent table */
-			if (childOID == relid)
-				continue;
-
-			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
-			childclassForm = (Form_pg_class) GETSTRUCT(childtuple);
-
-			/* Skip foreign partitions */
-			if (childclassForm->relkind == RELKIND_FOREIGN_TABLE)
-				continue;
-
-			/* Sum up the child's reltuples for the partitioned table */
-			all_reltuples += childclassForm->reltuples;
-		}
-
-		/* Fetch reloptions and the pgstat entry for the partitioned table */
-		avopts = extract_autovac_opts(classTup, pg_class_desc);
-		tabentry = get_pgstat_tabentry_relid(relid,
-											 classForm->relisshared,
-											 shared, dbentry);
-
-		/* Check if it needs auto analyze */
-		av_enabled = (avopts ? avopts->enabled : true);
-
-		if (av_enabled)
-		{
-			anl_scale_factor = (avopts && avopts->analyze_scale_factor >= 0)
-				? avopts->analyze_scale_factor
-				: autovacuum_anl_scale;
-
-			anl_base_thresh = (avopts && avopts->analyze_threshold >= 0)
-				? avopts->analyze_threshold
-				: autovacuum_anl_thresh;
-
-			if (PointerIsValid(tabentry))
-			{
-				reltuples = all_reltuples;
-				anltuples = tabentry->changes_since_analyze;
-				anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
-
-				elog(DEBUG3, "%s: anl: %.0f (threshold %.0f)",
-					 NameStr(classForm->relname),
-					 anltuples, anlthresh);
-
-				doanalyze = (anltuples > anlthresh);
-			}
-		}
-	}
+	/* ignore ANALYZE for toast tables */
+	if (classForm->relkind == RELKIND_TOASTVALUE)
+		doanalyze = false;
 
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
@@ -3148,6 +2992,42 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(classForm != NULL);
 	AssertArg(OidIsValid(relid));
 
+	/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
 	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
#17yuzuko
yuzukohosoya@gmail.com
In reply to: Amit Langote (#16)
1 attachment(s)
Re: Autovacuum on partitioned table

Hello Amit-san,

Thanks for your comments.

* White-space noise in the diff (space used where tab is expected);
please check with git diff --check and fix.

Fixed it.

* Names changes_tuples, m_changes_tuples should be changed_tuples and
m_changed_tuples, respectively?

Yes, I modified it.

* Did you intend to make it so that we now report *all* inherited
stats to the stats collector, not just those for partitioned tables?
IOW, do did you intend the new feature to also cover traditional
inheritance parents? I am talking about the following diff:

I modified as follows to apply this feature to only declaretive partitioning.

- if (!inh)
-  pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-         (va_cols == NIL));
+ if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+        (va_cols == NIL));

Having read the relevant diffs again, I think this could be done
without duplicating code too much. You seem to have added the same
logic in two places: do_autovacuum() and table_recheck_autovac().
More importantly, part of the logic of relation_needs_vacanalyze() is
duplicated in both of the aforementioned places, which I think is
unnecessary and undesirable if you consider maintainability. I think
we could just add the logic to compute reltuples for partitioned
tables at the beginning of relation_needs_vacanalyze() and be done.

Yes, indeed. Partitioned tables don't need to vacuum so I added new
checking process for partitioned tables outside relation_needs_vacanalyze().
However, partitioned tables' tabentry->n_dead_tuples are always 0 so
dovacuum is always false. So I think that checking both auto vacuum
and analyze for partitioned tables doesn't matter. I merged v3_amit_delta.patch
into the new patch and found minor bug, partitioned table's reltuples is
overwritten with it's classForm->reltuples, so I fixed it.

Also, I think partitioned tables' changes_since_analyze should be reported
only when Autovacuum process. So I fixed it too.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v4_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v4_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 4a2b6f0dae..88c635f82f 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1310,8 +1310,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..20183a96a4 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -227,7 +227,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -379,7 +379,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1586,13 +1586,12 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..161abb6450 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -584,7 +584,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -604,7 +604,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c4420ddd7f..bbfdf75dbe 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,15 +655,15 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  If the table is a
+	 * partition, report changes_since_analyze of its parent because
+	 * autovacuum process for partitioned tables needs it.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
-		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+						  (va_cols == NIL));
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..aabb1903de 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2031,11 +2032,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2058,7 +2059,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2720,6 +2722,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2990,6 +2993,42 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(OidIsValid(relid));
 
 	/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
+	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
 	 * table), or the autovacuum GUC variables.
@@ -3056,7 +3095,8 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		anltuples = tabentry->changes_since_analyze;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f31ab..ced2599050 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -67,6 +68,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 /* ----------
@@ -322,6 +324,7 @@ static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, in
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1463,6 +1466,33 @@ pgstat_report_analyze(Relation rel,
 		deadtuples = Max(deadtuples, 0);
 	}
 
+	/*
+	 * If the table is a leaf partition, tell the stats collector its parent's
+	 * changes_since_analyze for auto analyze
+	 */
+	if (IsAutoVacuumWorkerProcess() &&
+		rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	{
+		Oid      parentoid;
+		Relation parentrel;
+		PgStat_StatDBEntry *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, tabentry->changes_since_analyze);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
@@ -1475,6 +1505,49 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partanalyze() -
+ *
+ *	Tell the collector about the parent table of which partition just analyzed.
+ *
+ * Caller must provide a child's changes_since_analyze as a parents.
+ * --------
+ */
+void
+pgstat_report_partanalyze(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartAnalyze msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	/*
+	 * If the partitioned table is also a partition, tell the stats collector
+	 * its parent's changes_since_analyze for auto analyze
+	 */
+	if (rel->rd_rel->relispartition)
+	{
+		Oid      parentoid;
+		Relation parentrel;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, changed_tuples);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTANALYZE);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1749,6 +1822,7 @@ pgstat_initstats(Relation rel)
 	/* We only count stats for things that have storage */
 	if (!(relkind == RELKIND_RELATION ||
 		  relkind == RELKIND_MATVIEW ||
+		  relkind == RELKIND_PARTITIONED_TABLE ||
 		  relkind == RELKIND_INDEX ||
 		  relkind == RELKIND_TOASTVALUE ||
 		  relkind == RELKIND_SEQUENCE))
@@ -4592,6 +4666,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTANALYZE:
+					pgstat_recv_partanalyze(&msg.msg_partanalyze, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6239,6 +6317,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..9bb872b171 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -57,6 +57,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTANALYZE,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -389,6 +390,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartAnalyze		Sent by the backend or autovacuum daemon
+ *                              after ANALYZE for partitioned tables
+ * ----------
+ */
+typedef struct PgStat_MsgPartAnalyze
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartAnalyze;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -562,6 +575,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartAnalyze msg_partanalyze;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -1267,7 +1281,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partanalyze(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..7e9f6de9cb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1789,7 +1789,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2117,7 +2117,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#18Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: yuzuko (#17)
Re: Autovacuum on partitioned table

On Fri, 21 Feb 2020 at 15:14, yuzuko <yuzukohosoya@gmail.com> wrote:

Hello Amit-san,

Thanks for your comments.

* White-space noise in the diff (space used where tab is expected);
please check with git diff --check and fix.

Fixed it.

* Names changes_tuples, m_changes_tuples should be changed_tuples and
m_changed_tuples, respectively?

Yes, I modified it.

* Did you intend to make it so that we now report *all* inherited
stats to the stats collector, not just those for partitioned tables?
IOW, do did you intend the new feature to also cover traditional
inheritance parents? I am talking about the following diff:

I modified as follows to apply this feature to only declaretive partitioning.

- if (!inh)
-  pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-         (va_cols == NIL));
+ if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+        (va_cols == NIL));

Having read the relevant diffs again, I think this could be done
without duplicating code too much. You seem to have added the same
logic in two places: do_autovacuum() and table_recheck_autovac().
More importantly, part of the logic of relation_needs_vacanalyze() is
duplicated in both of the aforementioned places, which I think is
unnecessary and undesirable if you consider maintainability. I think
we could just add the logic to compute reltuples for partitioned
tables at the beginning of relation_needs_vacanalyze() and be done.

Yes, indeed. Partitioned tables don't need to vacuum so I added new
checking process for partitioned tables outside relation_needs_vacanalyze().
However, partitioned tables' tabentry->n_dead_tuples are always 0 so
dovacuum is always false. So I think that checking both auto vacuum
and analyze for partitioned tables doesn't matter. I merged v3_amit_delta.patch
into the new patch and found minor bug, partitioned table's reltuples is
overwritten with it's classForm->reltuples, so I fixed it.

Also, I think partitioned tables' changes_since_analyze should be reported
only when Autovacuum process. So I fixed it too.

Thank you for updating the patch. I tested v4 patch.

After analyze or autoanalyze on partitioned table n_live_tup and
n_dead_tup are updated. However, TRUNCATE and VACUUM on the
partitioned table don't change these values until invoking analyze or
autoanalyze whereas in normal tables these values are reset or
changed. For example, with your patch:

* Before
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 11 | 0 | 0
c2 | 11 | 0 | 0
c3 | 11 | 0 | 0
c4 | 11 | 0 | 0
c5 | 11 | 0 | 0
parent | 55 | 0 | 0
(6 rows)

* After 'TRUNCATE parent'
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 0 | 0
c2 | 0 | 0 | 0
c3 | 0 | 0 | 0
c4 | 0 | 0 | 0
c5 | 0 | 0 | 0
parent | 55 | 0 | 0
(6 rows)

* Before
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 11 | 0
c2 | 0 | 11 | 0
c3 | 0 | 11 | 0
c4 | 0 | 11 | 0
c5 | 0 | 11 | 0
parent | 0 | 55 | 0
(6 rows)

* After 'VACUUM parent'
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 0 | 0
c2 | 0 | 0 | 0
c3 | 0 | 0 | 0
c4 | 0 | 0 | 0
c5 | 0 | 0 | 0
parent | 0 | 55 | 0
(6 rows)

We can make it work correctly but I think perhaps we can skip updating
statistics values of partitioned tables other than n_mod_since_analyze
as the first step. Because if we support also n_live_tup and
n_dead_tup, user might get confused that other statistics values such
as seq_scan, seq_tup_read however are not supported.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Amit Langote
amitlangote09@gmail.com
In reply to: Masahiko Sawada (#18)
Re: Autovacuum on partitioned table

On Fri, Feb 21, 2020 at 4:47 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

Thank you for updating the patch. I tested v4 patch.

After analyze or autoanalyze on partitioned table n_live_tup and
n_dead_tup are updated. However, TRUNCATE and VACUUM on the
partitioned table don't change these values until invoking analyze or
autoanalyze whereas in normal tables these values are reset or
changed. For example, with your patch:

* Before
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 11 | 0 | 0
c2 | 11 | 0 | 0
c3 | 11 | 0 | 0
c4 | 11 | 0 | 0
c5 | 11 | 0 | 0
parent | 55 | 0 | 0
(6 rows)

* After 'TRUNCATE parent'
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 0 | 0
c2 | 0 | 0 | 0
c3 | 0 | 0 | 0
c4 | 0 | 0 | 0
c5 | 0 | 0 | 0
parent | 55 | 0 | 0
(6 rows)

* Before
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 11 | 0
c2 | 0 | 11 | 0
c3 | 0 | 11 | 0
c4 | 0 | 11 | 0
c5 | 0 | 11 | 0
parent | 0 | 55 | 0
(6 rows)

* After 'VACUUM parent'
relname | n_live_tup | n_dead_tup | n_mod_since_analyze
---------+------------+------------+---------------------
c1 | 0 | 0 | 0
c2 | 0 | 0 | 0
c3 | 0 | 0 | 0
c4 | 0 | 0 | 0
c5 | 0 | 0 | 0
parent | 0 | 55 | 0
(6 rows)

We can make it work correctly but I think perhaps we can skip updating
statistics values of partitioned tables other than n_mod_since_analyze
as the first step. Because if we support also n_live_tup and
n_dead_tup, user might get confused that other statistics values such
as seq_scan, seq_tup_read however are not supported.

+1, that makes sense.

Thanks,
Amit

#20yuzuko
yuzukohosoya@gmail.com
In reply to: Amit Langote (#19)
1 attachment(s)
Re: Autovacuum on partitioned table

Hi,

Thanks for reviewing the patch.

We can make it work correctly but I think perhaps we can skip updating
statistics values of partitioned tables other than n_mod_since_analyze
as the first step. Because if we support also n_live_tup and
n_dead_tup, user might get confused that other statistics values such
as seq_scan, seq_tup_read however are not supported.

+1, that makes sense.

Yes, Indeed. I modified it not to update statistics other than
n_mod_since_analyze.
Attach the v5 patch. In this patch, pgstat_report_analyze() always reports 0 as
msg.m_live_tuples and m_dead_tuples when the relation is partitioned.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v5_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v5_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 4a2b6f0dae..88c635f82f 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1310,8 +1310,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..20183a96a4 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -227,7 +227,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -379,7 +379,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1586,13 +1586,12 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..161abb6450 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -584,7 +584,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -604,7 +604,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c4420ddd7f..bbfdf75dbe 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,15 +655,15 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  If the table is a
+	 * partition, report changes_since_analyze of its parent because
+	 * autovacuum process for partitioned tables needs it.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
-		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+						  (va_cols == NIL));
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..aabb1903de 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2031,11 +2032,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2058,7 +2059,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2720,6 +2722,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2990,6 +2993,42 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(OidIsValid(relid));
 
 	/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
+	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
 	 * table), or the autovacuum GUC variables.
@@ -3056,7 +3095,8 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		anltuples = tabentry->changes_since_analyze;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f31ab..3071487c47 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -67,6 +68,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 /* ----------
@@ -322,6 +324,7 @@ static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, in
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1463,17 +1466,91 @@ pgstat_report_analyze(Relation rel,
 		deadtuples = Max(deadtuples, 0);
 	}
 
+	/*
+	 * If the table is a leaf partition, tell the stats collector its parent's
+	 * changes_since_analyze for auto analyze
+	 */
+	if (IsAutoVacuumWorkerProcess() &&
+		rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	{
+		Oid      parentoid;
+		Relation parentrel;
+		PgStat_StatDBEntry *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, tabentry->changes_since_analyze);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* partitioned tables don't have any data, so it's 0 */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* partitioned tables don't have any data, so it's 0 */
+		: deadtuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* --------
+ * pgstat_report_partanalyze() -
+ *
+ *	Tell the collector about the parent table of which partition just analyzed.
+ *
+ * Caller must provide a child's changes_since_analyze as a parents.
+ * --------
+ */
+void
+pgstat_report_partanalyze(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartAnalyze msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	/*
+	 * If the partitioned table is also a partition, tell the stats collector
+	 * its parent's changes_since_analyze for auto analyze
+	 */
+	if (rel->rd_rel->relispartition)
+	{
+		Oid      parentoid;
+		Relation parentrel;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);
+
+		/* Report changes_since_analyze to the stats collector */
+		pgstat_report_partanalyze(parentrel, changed_tuples);
+
+		table_close(parentrel, AccessShareLock);
+	}
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTANALYZE);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
+
 /* --------
  * pgstat_report_recovery_conflict() -
  *
@@ -1749,6 +1826,7 @@ pgstat_initstats(Relation rel)
 	/* We only count stats for things that have storage */
 	if (!(relkind == RELKIND_RELATION ||
 		  relkind == RELKIND_MATVIEW ||
+		  relkind == RELKIND_PARTITIONED_TABLE ||
 		  relkind == RELKIND_INDEX ||
 		  relkind == RELKIND_TOASTVALUE ||
 		  relkind == RELKIND_SEQUENCE))
@@ -4592,6 +4670,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTANALYZE:
+					pgstat_recv_partanalyze(&msg.msg_partanalyze, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6239,6 +6321,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partanalyze(PgStat_MsgPartAnalyze *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..9bb872b171 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -57,6 +57,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTANALYZE,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -389,6 +390,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartAnalyze		Sent by the backend or autovacuum daemon
+ *                              after ANALYZE for partitioned tables
+ * ----------
+ */
+typedef struct PgStat_MsgPartAnalyze
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartAnalyze;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -562,6 +575,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartAnalyze msg_partanalyze;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -1267,7 +1281,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partanalyze(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..7e9f6de9cb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1789,7 +1789,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2117,7 +2117,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#21Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: yuzuko (#20)
Re: Autovacuum on partitioned table

On Wed, 26 Feb 2020 at 11:33, yuzuko <yuzukohosoya@gmail.com> wrote:

Hi,

Thanks for reviewing the patch.

We can make it work correctly but I think perhaps we can skip updating
statistics values of partitioned tables other than n_mod_since_analyze
as the first step. Because if we support also n_live_tup and
n_dead_tup, user might get confused that other statistics values such
as seq_scan, seq_tup_read however are not supported.

+1, that makes sense.

Yes, Indeed. I modified it not to update statistics other than
n_mod_since_analyze.
Attach the v5 patch. In this patch, pgstat_report_analyze() always reports 0 as
msg.m_live_tuples and m_dead_tuples when the relation is partitioned.

Thank you for updating the patch. I'll look at it. I'd recommend to
register this patch to the next commit fest so at not to forget.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: yuzuko (#20)
Re: Autovacuum on partitioned table

Hello Yuzuko,

+	 * Report ANALYZE to the stats collector, too.  If the table is a
+	 * partition, report changes_since_analyze of its parent because
+	 * autovacuum process for partitioned tables needs it.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
*/
-	if (!inh)
-		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+						  (va_cols == NIL));

Hmm, I think the comment has a bug: it says "report ... of its parent"
but the report is of the same rel. (The pgstat_report_analyze line is
mis-indented also).

/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			/* Ignore the parent table */
+			if (childOID == relid)
+				continue;

I think this loop counts partitioned partitions multiple times, because
we add up reltuples for all levels, no? (If I'm wrong, that is, if
a partitioned rel does not have reltuples, then why skip the parent?)

+	/*
+	 * If the table is a leaf partition, tell the stats collector its parent's
+	 * changes_since_analyze for auto analyze
+	 */
+	if (IsAutoVacuumWorkerProcess() &&
+		rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))

I'm not sure I understand why we do this only on autovac. Why not all
analyzes?

+	{
+		Oid      parentoid;
+		Relation parentrel;
+		PgStat_StatDBEntry *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Get its parent table's Oid and relation */
+		parentoid = get_partition_parent(RelationGetRelid(rel));
+		parentrel = table_open(parentoid, AccessShareLock);

Climbing up the partitioning hierarchy acquiring locks on ancestor
relations opens up for deadlocks. It's better to avoid that. (As a
test, you could try what happens if you lock the topmost relation with
access-exclusive and leave a transaction open, then have autoanalyze
run). At the same time, I wonder if it's sensible to move one level up
here, and also have pgstat_report_partanalyze move more levels up.

+ * pgstat_report_partanalyze() -
+ *
+ *	Tell the collector about the parent table of which partition just analyzed.
+ *
+ * Caller must provide a child's changes_since_analyze as a parents.

I'm not sure what the last line is trying to say.

Thanks,

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23Amit Langote
amitlangote09@gmail.com
In reply to: yuzuko (#20)
Re: Autovacuum on partitioned table

Hosoya-san,

Thanks for the new patch.

On Wed, Feb 26, 2020 at 11:33 AM yuzuko <yuzukohosoya@gmail.com> wrote:

Attach the v5 patch. In this patch, pgstat_report_analyze() always reports 0 as
msg.m_live_tuples and m_dead_tuples when the relation is partitioned.

Some comments:

+ * PgStat_MsgPartAnalyze        Sent by the backend or autovacuum daemon
+ *                              after ANALYZE for partitioned tables

Looking at the way this message is used, it does not seem to be an
"analyze" message and also it's not sent "after ANALYZE of partitioned
tables", but really after ANALYZE of leaf partitions. Analyze (for
both partitioned tables and leaf partitions) is reported as a
PgStat_MsgAnalyze message as before. It seems that
PgStat_MsgPartAnalyze is only sent to update a leaf partition's
parent's (and recursively any grandparents') changes_since_analyze
counters, so maybe we should find a different name for it. Maybe,
PgStat_MsgPartChanges and accordingly the message type enum value.

     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
-     * auto-analyze to do.
+     * Report ANALYZE to the stats collector, too.  If the table is a
+     * partition, report changes_since_analyze of its parent because
+     * autovacuum process for partitioned tables needs it.  Reset the
+     * changes_since_analyze counter only if we analyzed all columns;
+     * otherwise, there is still work for auto-analyze to do.
      */

The new comment says "partitions", which we typically use to refer to
a child table, but this comment really talks about parent tables. Old
comment says we don't report "inherited stats", presumably because
stats collector lacks the infrastructure to distinguish a table's
inherited stats and own stats, at least in the case of traditional
inheritance. With this patch, we are making an exception for
partitioned tables, because we are also teaching the stats collector
to maintain at least changes_since_analyze for them that accumulates
counts of changed tuples from partitions.

It seems Alvaro already reported some of the other issues I had with
the patch, such as why partanalyze messages are only sent from a
autovacuum worker.

Thanks,
Amit

#24Amit Langote
amitlangote09@gmail.com
In reply to: Alvaro Herrera (#22)
Re: Autovacuum on partitioned table

On Fri, Feb 28, 2020 at 11:25 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

/*
+      * If the relation is a partitioned table, we must add up children's
+      * reltuples.
+      */
+     if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+     {
+             List     *children;
+             ListCell *lc;
+
+             reltuples = 0;
+
+             /* Find all members of inheritance set taking AccessShareLock */
+             children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+             foreach(lc, children)
+             {
+                     Oid        childOID = lfirst_oid(lc);
+                     HeapTuple  childtuple;
+                     Form_pg_class childclass;
+
+                     /* Ignore the parent table */
+                     if (childOID == relid)
+                             continue;

I think this loop counts partitioned partitions multiple times, because
we add up reltuples for all levels, no? (If I'm wrong, that is, if
a partitioned rel does not have reltuples, then why skip the parent?)

+1, no need to skip partitioned tables here a their reltuples is always 0.

+     /*
+      * If the table is a leaf partition, tell the stats collector its parent's
+      * changes_since_analyze for auto analyze

Maybe write:

For a leaf partition, add its current changes_since_analyze into its
ancestors' counts. This must be done before sending the ANALYZE
message as it resets the partition's changes_since_analyze counter.

+      */
+     if (IsAutoVacuumWorkerProcess() &&
+             rel->rd_rel->relispartition &&
+             !(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))

I'm not sure I understand why we do this only on autovac. Why not all
analyzes?

+1. If there is a reason, it should at least be documented in the
comment above.

+     {
+             Oid      parentoid;
+             Relation parentrel;
+             PgStat_StatDBEntry *dbentry;
+             PgStat_StatTabEntry *tabentry;
+
+             /* Get its parent table's Oid and relation */
+             parentoid = get_partition_parent(RelationGetRelid(rel));
+             parentrel = table_open(parentoid, AccessShareLock);

Climbing up the partitioning hierarchy acquiring locks on ancestor
relations opens up for deadlocks. It's better to avoid that. (As a
test, you could try what happens if you lock the topmost relation with
access-exclusive and leave a transaction open, then have autoanalyze
run). At the same time, I wonder if it's sensible to move one level up
here, and also have pgstat_report_partanalyze move more levels up.

Maybe fetch all ancestors here and process from the top. But as we'd
have locked the leaf partition long before we got here, maybe we
should lock ancestors even before we start analyzing the leaf
partition? AccessShareLock should be enough on the ancestors because
we're not actually analyzing them.

(It appears get_partition_ancestors() returns a list where the root
parent is the last element, so need to be careful with that.)

Thanks,
Amit

#25yuzuko
yuzukohosoya@gmail.com
In reply to: Amit Langote (#24)
Re: Autovacuum on partitioned table

Hello,

Thank you for reviewing.

+      */
+     if (IsAutoVacuumWorkerProcess() &&
+             rel->rd_rel->relispartition &&
+             !(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))

I'm not sure I understand why we do this only on autovac. Why not all
analyzes?

+1. If there is a reason, it should at least be documented in the
comment above.

When we analyze partitioned table by ANALYZE command,
all inheritors including partitioned table are analyzed
at the same time. In this case, if we call pgstat_report_partanalyze,
partitioned table's changes_since_analyze is updated
according to the number of analyzed tuples of partitions
as follows. But I think it should be 0.

\d+ p
Partitioned table "public.p"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
i | integer | | | | plain | |
Partition key: RANGE (i)
Partitions: p_1 FOR VALUES FROM (0) TO (100),
p_2 FOR VALUES FROM (100) TO (200)

insert into p select * from generate_series(0,199);
INSERT 0 200

(before analyze)
-[ RECORD 1 ]-------+------------------
relname | p
n_mod_since_analyze | 0
-[ RECORD 2 ]-------+------------------
relname | p_1
n_mod_since_analyze | 100
-[ RECORD 3 ]-------+------------------
relname | p_2
n_mod_since_analyze | 100

(after analyze)
-[ RECORD 1 ]-------+------------------
relname | p
n_mod_since_analyze | 200
-[ RECORD 2 ]-------+------------------
relname | p_1
n_mod_since_analyze | 0
-[ RECORD 3 ]-------+------------------
relname | p_2
n_mod_since_analyze | 0

I think if we analyze partition tree in order from leaf partitions
to root table, this problem can be fixed.
What do you think about it?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#26yuzuko
yuzukohosoya@gmail.com
In reply to: yuzuko (#25)
1 attachment(s)
Re: Autovacuum on partitioned table

Hello,

+      */
+     if (IsAutoVacuumWorkerProcess() &&
+             rel->rd_rel->relispartition &&
+             !(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))

I'm not sure I understand why we do this only on autovac. Why not all
analyzes?

+1. If there is a reason, it should at least be documented in the
comment above.

When we analyze partitioned table by ANALYZE command,
all inheritors including partitioned table are analyzed
at the same time. In this case, if we call pgstat_report_partanalyze,
partitioned table's changes_since_analyze is updated
according to the number of analyzed tuples of partitions
as follows. But I think it should be 0.

\d+ p
Partitioned table "public.p"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
i | integer | | | | plain | |
Partition key: RANGE (i)
Partitions: p_1 FOR VALUES FROM (0) TO (100),
p_2 FOR VALUES FROM (100) TO (200)

insert into p select * from generate_series(0,199);
INSERT 0 200

(before analyze)
-[ RECORD 1 ]-------+------------------
relname | p
n_mod_since_analyze | 0
-[ RECORD 2 ]-------+------------------
relname | p_1
n_mod_since_analyze | 100
-[ RECORD 3 ]-------+------------------
relname | p_2
n_mod_since_analyze | 100

(after analyze)
-[ RECORD 1 ]-------+------------------
relname | p
n_mod_since_analyze | 200
-[ RECORD 2 ]-------+------------------
relname | p_1
n_mod_since_analyze | 0
-[ RECORD 3 ]-------+------------------
relname | p_2
n_mod_since_analyze | 0

I think if we analyze partition tree in order from leaf partitions
to root table, this problem can be fixed.
What do you think about it?

Attach the new patch fixes the above problem. Also, This patch
includes modifications accoring to all comments Alvaro and Amit
mentioned before in this thread.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v6_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v6_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 4a2b6f0dae..88c635f82f 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1310,8 +1310,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..20183a96a4 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -227,7 +227,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -379,7 +379,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1586,13 +1586,12 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..161abb6450 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -584,7 +584,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -604,7 +604,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c4420ddd7f..b6cbdf3471 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,13 +655,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.  Also,
+	 * if the table is a leaf partition, we add its current
+	 * changes_since_analyze into its ancestors' counts because
+	 * autovacuum process for partitioned tables needs it.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d625d17bf4..546c35652d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -828,9 +828,9 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				 * later.
 				 */
 				oldcontext = MemoryContextSwitchTo(vac_context);
-				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
-															  part_oid,
-															  vrel->va_cols));
+				vacrels = lcons(makeVacuumRelation(NULL, part_oid, vrel->va_cols),
+							    vacrels);
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..eed391f3cd 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2031,11 +2032,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2058,7 +2059,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2720,6 +2722,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2990,6 +2993,38 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(OidIsValid(relid));
 
 	/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
+	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
 	 * table), or the autovacuum GUC variables.
@@ -3056,7 +3091,8 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		anltuples = tabentry->changes_since_analyze;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 59dc4f31ab..d0c9e14403 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -67,6 +68,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 /* ----------
@@ -322,6 +324,7 @@ static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, in
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
@@ -1463,18 +1466,76 @@ pgstat_report_analyze(Relation rel,
 		deadtuples = Max(deadtuples, 0);
 	}
 
+	/*
+	 * For a leaf partition, add its current changes_since_analyze
+	 * into its ancestors' counts.  This must be done before sending
+	 * the ANALYZE message as it resets the partition's changes_since_analze
+	 * counter.
+	 */
+	if (rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+
+	{
+		List     *ancestors;
+		ListCell *lc;
+		PgStat_StatDBEntry *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/* Get its all ancestors */
+		ancestors = get_partition_ancestors(RelationGetRelid(rel));
+		foreach(lc, ancestors)
+		{
+			Oid    parentOid = lfirst_oid(lc);
+			Relation parentrel = table_open(parentOid, AccessShareLock);
+
+			/* Report changes_since_analyze to the stats collector */
+			pgstat_report_partchanges(parentrel, tabentry->changes_since_analyze);
+
+			table_close(parentrel, AccessShareLock);
+		}
+	}
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* partitioned tables don't have any data, so it's 0 */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* partitioned tables don't have any data, so it's 0 */
+		: deadtuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *	Tell the collector about the parent table of which partition just analyzed.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1749,6 +1810,7 @@ pgstat_initstats(Relation rel)
 	/* We only count stats for things that have storage */
 	if (!(relkind == RELKIND_RELATION ||
 		  relkind == RELKIND_MATVIEW ||
+		  relkind == RELKIND_PARTITIONED_TABLE ||
 		  relkind == RELKIND_INDEX ||
 		  relkind == RELKIND_TOASTVALUE ||
 		  relkind == RELKIND_SEQUENCE))
@@ -4592,6 +4654,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6239,6 +6305,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..b979ed32d4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -57,6 +57,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_FUNCSTAT,
@@ -389,6 +390,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges		Sent by the autovacuum daemon
+ *                              after ANALYZE of leaf partitions
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -562,6 +575,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgFuncstat msg_funcstat;
@@ -1267,7 +1281,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..7e9f6de9cb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1789,7 +1789,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2117,7 +2117,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 0cfe28e63f..abf5be31ca 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -192,9 +192,9 @@ VACUUM (FULL) vacparted;
 VACUUM (FREEZE) vacparted;
 -- check behavior with duplicate column mentions
 VACUUM ANALYZE vacparted(a,b,a);
-ERROR:  column "a" of relation "vacparted" appears more than once
+ERROR:  column "a" of relation "vacparted1" appears more than once
 ANALYZE vacparted(a,b,b);
-ERROR:  column "b" of relation "vacparted" appears more than once
+ERROR:  column "b" of relation "vacparted1" appears more than once
 -- multiple tables specified
 VACUUM vaccluster, vactst;
 VACUUM vacparted, does_not_exist;
@@ -213,7 +213,7 @@ ANALYZE vacparted (b), vactst;
 ANALYZE vactst, does_not_exist, vacparted;
 ERROR:  relation "does_not_exist" does not exist
 ANALYZE vactst (i), vacparted (does_not_exist);
-ERROR:  column "does_not_exist" of relation "vacparted" does not exist
+ERROR:  column "does_not_exist" of relation "vacparted1" does not exist
 ANALYZE vactst, vactst;
 BEGIN;  -- ANALYZE behaves differently inside a transaction block
 ANALYZE vactst, vactst;
@@ -287,24 +287,24 @@ WARNING:  skipping "pg_authid" --- only superuser can vacuum it
 -- independently.
 VACUUM vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can vacuum it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
 ANALYZE vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can analyze it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
 VACUUM (ANALYZE) vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can vacuum it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part2;
@@ -357,22 +357,22 @@ ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
 ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER;
 SET ROLE regress_vacuum;
 VACUUM vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
 ANALYZE vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
 VACUUM (ANALYZE) vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part2;
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: yuzuko (#26)
Re: Autovacuum on partitioned table

On 2020-Mar-18, yuzuko wrote:

I think if we analyze partition tree in order from leaf partitions
to root table, this problem can be fixed.
What do you think about it?

Attach the new patch fixes the above problem.

Thanks for the new version.

I'm confused about some error messages in the regression test when a
column is mentioned twice, that changed from mentioning the table named
in the vacuum command, to mentioning the first partition. Is that
because you changed an lappend() to lcons()? I think you do this so
that the counters accumulate for the topmost parent that will be
processed at the end. I'm not sure I like that too much ... I think
that needs more thought.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28Justin Pryzby
pryzby@telsasoft.com
In reply to: yuzuko (#26)
Re: Autovacuum on partitioned table (autoanalyze)

Regarding this patch:

+ * the ANALYZE message as it resets the partition's changes_since_analze
=> analyze

+ * If the relation is a partitioned table, we must add up children's
childrens'

The approach in general:

I see an issue for timeseries data, where only the most recent partition is
being inserted into, and the histogram endpoint is being continuously extended
(this is our use-case). The most recent partition will be analyzed pretty
often, and I think it'll be problematic if its parent doesn't get similar
treatment. Let's say there are 12 historic, monthly children with 1e6 tuples
each, and the 13th child has 2e5 tuples (6 days into the month). It's analyzed
when it grows by 20% (1.2 days), but at that point the parent has only grown by
12x less (~2%) and won't be analyzed until 12x further into the future (14
days). Its histogram is 12x longer (geometrically), but the histogram changed
by just as much (arithmetically). That's an issue for a query over "the last
few days"; if that's past the end of the histogram bound, the query planner
will estimate about ~0 tuples, and tend to give cascades of nested loops. I'm
biased, but I'm guessing that's too common a use case to answer that the proper
fix is to set the parent's analyze_scale_factor=0.0005. I think that suggests
that the parent might sometimes need to be analyzed every time any of its
children are. In other cases (like probably any hash partitioning), that'd be
excessive, and maybe the default settings shouldn't do that, but I think that
behavior ought to be possible, and I think this patch doesn't allow that.

In the past, I think there's was talk that maybe someone would invent a clever
way to dynamically combine all the partitions' statistics, so analyzing the
parent wasn't needed. I think that's easy enough for reltuples, MCV, and I
think histogram, but ISTM that ndistinct is simultaneously important to get
right and hard to do so. It depends on whether it's the partition key, which
now can be an arbitrary expression. Extended stats further complicates it,
even if we didn't aim to dynamically compute extended stats for a parent.

While writing this, it occured to me that we could use "CREATE STATISTICS" as a
way to mark a partitioned table (or certain columns) as needing to be handled
by analyze. I understand "CREATE STATs" was intended to (eventually) allow
implementing stats on expressions without using "create index" as a hack. So
if it's excessive to automatically analyze a parent table when any of its
children are analyzed, maybe it's less excessive to only do that for parents
with a stats object, and only on the given colums. I realize this patch is
alot less useful if it requires to do anything extra/nondefault, and it's
desirable to work without creating a stats object at all. Also, using CREATE
STATs would reduces the CPU cost of re-analyzing the entire heirarchy, but
doesn't help to reduce the I/O cost, which is significant.

--
Justin

#29yuzuko
yuzukohosoya@gmail.com
In reply to: Alvaro Herrera (#27)
1 attachment(s)
Re: Autovacuum on partitioned table

Hi Alvaro,
Thank you for your comments.

I'm confused about some error messages in the regression test when a
column is mentioned twice, that changed from mentioning the table named
in the vacuum command, to mentioning the first partition. Is that
because you changed an lappend() to lcons()? I think you do this so
that the counters accumulate for the topmost parent that will be
processed at the end. I'm not sure I like that too much ... I think
that needs more thought.

I couldn't come up with a solution that counts changes_since_analyze
precisely when analyzing partitioned trees by ANALYZE command based on
this approach (update all ancestor's changes_since_analyze according to the
number of analyzed tuples of leaf partitions).

So I tried another approach to run autovacuum on partitioned tables.
In this approach, all ancestors' changed_tuples are updated when commiting
transactions (at AtEOXact_PgStat) according to the number of inserted/updated/
deleted tuples of leaf partitions.

Attach the latest patch. What do you think?
--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v7_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v7_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 155866c7c8..046a397ac3 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1322,8 +1322,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..ffda2f45bf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -585,7 +585,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -605,7 +605,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..de7f1c3bb1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,13 +655,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e..8b3cf85389 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -831,6 +831,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
 															  vrel->va_cols));
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..47905380fb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2033,11 +2034,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2060,7 +2061,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2723,6 +2725,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2997,6 +3000,38 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(OidIsValid(relid));
 
 	/*
+	 * If the relation is a partitioned table, we must add up children's
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
+	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
 	 * table), or the autovacuum GUC variables.
@@ -3072,7 +3107,8 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9ebde47dea..1740b50352 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -1523,8 +1524,12 @@ pgstat_report_analyze(Relation rel,
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* partitioned tables don't have any data, so it's 0 */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* partitioned tables don't have any data, so it's 0 */
+		: deadtuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -1803,6 +1808,7 @@ pgstat_initstats(Relation rel)
 	/* We only count stats for things that have storage */
 	if (!(relkind == RELKIND_RELATION ||
 		  relkind == RELKIND_MATVIEW ||
+		  relkind == RELKIND_PARTITIONED_TABLE ||
 		  relkind == RELKIND_INDEX ||
 		  relkind == RELKIND_TOASTVALUE ||
 		  relkind == RELKIND_SEQUENCE))
@@ -2001,6 +2007,28 @@ pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2022,6 +2050,28 @@ pgstat_count_heap_update(Relation rel, bool hot)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2047,6 +2097,28 @@ pgstat_count_heap_delete(Relation rel)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2201,6 +2273,29 @@ AtEOXact_PgStat(bool isCommit, bool parallel)
 				tabstat->t_counts.t_changed_tuples +=
 					trans->tuples_inserted + trans->tuples_updated +
 					trans->tuples_deleted;
+
+				/*
+				 * If this relation is partitioned, propagate its own
+				 * changed_tuples to their all ancestors.
+				 */
+				if (tabstat->ancestors != NULL)
+				{
+					int i = 0;
+
+					for(;;)
+					{
+						PgStat_TableStatus *entry;
+						Oid                relid = tabstat->ancestors[i];
+
+						if(relid == InvalidOid)
+							break;
+
+						entry = get_tabstat_entry(relid, false);
+						entry->t_counts.t_changed_tuples +=
+							tabstat->t_counts.t_changed_tuples;
+						++i;
+					}
+				}
 			}
 			else
 			{
@@ -6355,7 +6450,6 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
-
 /* ----------
  * pgstat_recv_archiver() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..1165d4fb2c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -157,6 +157,7 @@ typedef enum PgStat_Single_Reset_Type
 typedef struct PgStat_TableStatus
 {
 	Oid			t_id;			/* table's OID */
+	Oid        *ancestors;      /* all ancestors */
 	bool		t_shared;		/* is it a shared catalog? */
 	struct PgStat_TableXactStatus *trans;	/* lowest subxact's counts */
 	PgStat_TableCounts t_counts;	/* event counts to be sent */
@@ -404,7 +405,6 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
-
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
  * ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6eec8ec568..c32745a0a2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1790,7 +1790,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2148,7 +2148,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#30Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#28)
Re: Autovacuum on partitioned table (autoanalyze)

Not sure if you saw my earlier message ?

I think it ought to be possible to configure this feature such that an
auto-analyze on any child partition would trigger analyze of the parent. I
think that would be important for maintaining accurate stats of the partition
key column for many cases involving RANGE-partitioned tables, which are likely
to rely on histogram rather than MCVs.

On Wed, Mar 18, 2020 at 11:30:39AM -0500, Justin Pryzby wrote:

Regarding this patch:

+ * the ANALYZE message as it resets the partition's changes_since_analze
=> analyze

+ * If the relation is a partitioned table, we must add up children's
childrens'

The approach in general:

I see an issue for timeseries data, where only the most recent partition is
being inserted into, and the histogram endpoint is being continuously extended
(this is our use-case). The most recent partition will be analyzed pretty
often, and I think it'll be problematic if its parent doesn't get similar
treatment. Let's say there are 12 historic, monthly children with 1e6 tuples
each, and the 13th child has 2e5 tuples (6 days into the month). It's analyzed
when it grows by 20% (1.2 days), but at that point the parent has only grown by
12x less (~2%) and won't be analyzed until 12x further into the future (14
days). Its histogram is 12x longer (geometrically), but the histogram changed
by just as much (arithmetically). That's an issue for a query over "the last
few days"; if that's past the end of the histogram bound, the query planner
will estimate about ~0 tuples, and tend to give cascades of nested loops. I'm
biased, but I'm guessing that's too common a use case to answer that the proper
fix is to set the parent's analyze_scale_factor=0.0005. I think that suggests
that the parent might sometimes need to be analyzed every time any of its
children are. In other cases (like probably any hash partitioning), that'd be
excessive, and maybe the default settings shouldn't do that, but I think that
behavior ought to be possible, and I think this patch doesn't allow that.

In the past, I think there's was talk that maybe someone would invent a clever
way to dynamically combine all the partitions' statistics, so analyzing the
parent wasn't needed. I think that's easy enough for reltuples, MCV, and I
think histogram, but ISTM that ndistinct is simultaneously important to get
right and hard to do so. It depends on whether it's the partition key, which
now can be an arbitrary expression. Extended stats further complicates it,
even if we didn't aim to dynamically compute extended stats for a parent.

While writing this, it occured to me that we could use "CREATE STATISTICS" as a
way to mark a partitioned table (or certain columns) as needing to be handled
by analyze. I understand "CREATE STATs" was intended to (eventually) allow
implementing stats on expressions without using "create index" as a hack. So
if it's excessive to automatically analyze a parent table when any of its
children are analyzed, maybe it's less excessive to only do that for parents
with a stats object, and only on the given colums. I realize this patch is
alot less useful if it requires to do anything extra/nondefault, and it's
desirable to work without creating a stats object at all. Also, using CREATE
STATs would reduces the CPU cost of re-analyzing the entire heirarchy, but
doesn't help to reduce the I/O cost, which is significant.

--
Justin

--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

#31yuzuko
yuzukohosoya@gmail.com
In reply to: Justin Pryzby (#30)
Re: Autovacuum on partitioned table (autoanalyze)

Hi Justin,

Thank you for commens.

On Tue, Apr 7, 2020 at 12:32 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Not sure if you saw my earlier message ?

I'm sorry, I didn't notice for a while.

I think it ought to be possible to configure this feature such that an
auto-analyze on any child partition would trigger analyze of the parent. I
think that would be important for maintaining accurate stats of the partition
key column for many cases involving RANGE-partitioned tables, which are likely
to rely on histogram rather than MCVs.

I read your previous email and understand that it would be neccesary to analyze
partitioned tables automatically when any of its children are analyzed. In my
first patch, auto-analyze on partitioned tables worked like this but there were
some comments about performance of autovacuum, especially when partitioned
tables have a lot of children.

The latest patch lets users set different autovacuum configuration for
each partitioned
tables like this,
create table p3(i int) partition by range(i) with
(autovacuum_analyze_scale_factor=0.0005, autovacuum_analyze_threshold=100);
so users can configure those parameters according to partitioning strategies
and other requirements.

So I think this patch can solve problem you mentioned.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#32Justin Pryzby
pryzby@telsasoft.com
In reply to: yuzuko (#31)
Re: Autovacuum on partitioned table (autoanalyze)

On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:

I think it ought to be possible to configure this feature such that an
auto-analyze on any child partition would trigger analyze of the parent. I
think that would be important for maintaining accurate stats of the partition
key column for many cases involving RANGE-partitioned tables, which are likely
to rely on histogram rather than MCVs.

I read your previous email and understand that it would be neccesary to analyze
partitioned tables automatically when any of its children are analyzed. In my
first patch, auto-analyze on partitioned tables worked like this but there were
some comments about performance of autovacuum, especially when partitioned
tables have a lot of children.

I reread that part. There was also confusion between autovacuum vacuum and
autovacuum analyze.

I agree that it *might* be a problem to analyze the parent every time any child
is analyzed.

But it might also be what's needed for this feature to be useful.

The latest patch lets users set different autovacuum configuration for
each partitioned
tables like this,
create table p3(i int) partition by range(i) with
(autovacuum_analyze_scale_factor=0.0005, autovacuum_analyze_threshold=100);
so users can configure those parameters according to partitioning strategies
and other requirements.

So I think this patch can solve problem you mentioned.

I don't think that adequately allows what's needed.

I think it out to be possible to get the "analyze parent whenever a child is
analyzed" behavior easily, without having to compute new thershold parameters
every time one adds partitions, detaches partitions, loades 10x more data into
one of the partitions, load only 10% as much data into the latest partition,
etc.

For example, say a new customer has bunch of partitioned tables which each
currently have only one partition (for the current month), and that's expected
to grow to at least 20+ partitions (2+ years of history). How does one set the
partitioned table's auto-analyze parameters to analyze whenever any child is
analyzed ? I don't think it should be needed to update it every month after
computing sum(child tuples).

Possibly you could allow that behavior for some special values of the
threshold. Like if autovacuum_analyze_threshold=-2, then analyze the parent
whenever any of its children are analyzed.

I think that use case and that need would be common, but I'd like to hear what
others think.

--
Justin

#33Amit Langote
amitlangote09@gmail.com
In reply to: Justin Pryzby (#32)
Re: Autovacuum on partitioned table (autoanalyze)

On Thu, Apr 16, 2020 at 11:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:

The latest patch lets users set different autovacuum configuration for
each partitioned
tables like this,
create table p3(i int) partition by range(i) with
(autovacuum_analyze_scale_factor=0.0005, autovacuum_analyze_threshold=100);
so users can configure those parameters according to partitioning strategies
and other requirements.

So I think this patch can solve problem you mentioned.

I don't think that adequately allows what's needed.

I think it out to be possible to get the "analyze parent whenever a child is
analyzed" behavior easily, without having to compute new thershold parameters
every time one adds partitions, detaches partitions, loades 10x more data into
one of the partitions, load only 10% as much data into the latest partition,
etc.

For example, say a new customer has bunch of partitioned tables which each
currently have only one partition (for the current month), and that's expected
to grow to at least 20+ partitions (2+ years of history). How does one set the
partitioned table's auto-analyze parameters to analyze whenever any child is
analyzed ? I don't think it should be needed to update it every month after
computing sum(child tuples).

Possibly you could allow that behavior for some special values of the
threshold. Like if autovacuum_analyze_threshold=-2, then analyze the parent
whenever any of its children are analyzed.

I think that use case and that need would be common, but I'd like to hear what
others think.

Having to constantly pay attention to whether a parent's
analyze_threshold/scale_factor is working as intended would surely be
an annoyance, so I tend to agree that we might need more than just the
ability to set analyze_threshold/scale_factor on parent tables.
However, I think we can at least start with being able to do
*something* here. :) Maybe others think that this shouldn't be
considered committable until we figure out a good analyze threshold
calculation formula to apply to parent tables.

For the cases in which parent's tuple count grows at about the same
rate as partitions (hash mainly), I guess the existing formula more or
less works. That is, we can set the parent's threshold/scale_factor
same as partitions' and the autovacuum's existing formula will ensure
that the parent is auto-analyzed in time and not more than needed. For
time-series partitioning, the same formula won't work, as you have
detailed in your comments. Is there any other partitioning pattern for
which the current formula won't work?

Considering that, how about having, say, a
autovacuum_analyze_partition_parent_frequency, with string values
'default', 'partition'? -- 'default' assumes the same formula as
regular tables, whereas with 'partition', parent is analyzed as soon
as a partition is.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#34Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#33)
Re: Autovacuum on partitioned table (autoanalyze)

On Fri, Apr 17, 2020 at 10:09:07PM +0900, Amit Langote wrote:

On Thu, Apr 16, 2020 at 11:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:
I don't think that adequately allows what's needed.

...(paragraph with my typos elided)...

For example, say a new customer has bunch of partitioned tables which each
currently have only one partition (for the current month), and that's expected
to grow to at least 20+ partitions (2+ years of history). How does one set the
partitioned table's auto-analyze parameters to analyze whenever any child is
analyzed ? I don't think it should be needed to update it every month after
computing sum(child tuples).

Possibly you could allow that behavior for some special values of the
threshold. Like if autovacuum_analyze_threshold=-2, then analyze the parent
whenever any of its children are analyzed.

I think that use case and that need would be common, but I'd like to hear what
others think.

Having to constantly pay attention to whether a parent's
analyze_threshold/scale_factor is working as intended would surely be
an annoyance, so I tend to agree that we might need more than just the
ability to set analyze_threshold/scale_factor on parent tables.
However, I think we can at least start with being able to do
*something* here. :) Maybe others think that this shouldn't be
considered committable until we figure out a good analyze threshold
calculation formula to apply to parent tables.

Considering that, how about having, say, a
autovacuum_analyze_partition_parent_frequency, with string values
'default', 'partition'? -- 'default' assumes the same formula as
regular tables, whereas with 'partition', parent is analyzed as soon
as a partition is.

I assume you mean a reloption to be applied only to partitioned tables,

Your "partition" setting would mean that the scale/threshold values would have
no effect, which seems kind of unfortunate.

I think it should be called something else, and done differently, like maybe:
autovacuum_analyze_mode = {off,sum,max,...}

The threshold would be threshold + scale*tuples, as always, but would be
compared with f(changes) as determined by the relopt.

sum(changes) would do what you called "default", comparing the sum(changes)
across all partitions to the threshold, which is itself computed using
sum(reltuples) AS reltuples.

max(changes) would compute max(changes) compared to the threshold, and the
threshold would be computed separately for each partition's reltuples:
threshold_N = parent_threshold + parent_scale * part_N_tuples. If *any*
partition exceeds that threshold, the partition itself is analyzed. This
allows what I want for time-series. Maybe this would have an alias called
"any".

I'm not sure if there's any other useful modes, like avg(changes)? I guess we
can add them later if someone thinks of a good use case.

Also, for me, the v7 patch warns:
|src/backend/postmaster/autovacuum.c:3117:70: warning: ‘reltuples’ may be used uninitialized in this function [-Wmaybe-uninitialized]
| vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;
..which seems to be a false positive, but easily avoided.

This patch includes partitioned tables in pg_stat_*_tables, which is great; I
complained awhile ago that they were missing [0]/messages/by-id/20180601221428.GU5164@telsasoft.com. It might be useful if that
part was split out into a separate 0001 patch (?).

Thanks,
--
Justin

[0]: /messages/by-id/20180601221428.GU5164@telsasoft.com

#35yuzuko
yuzukohosoya@gmail.com
In reply to: Justin Pryzby (#34)
Re: Autovacuum on partitioned table (autoanalyze)

Hello,

On Sat, Apr 18, 2020 at 2:08 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Apr 17, 2020 at 10:09:07PM +0900, Amit Langote wrote:

On Thu, Apr 16, 2020 at 11:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:
I don't think that adequately allows what's needed.

...(paragraph with my typos elided)...

For example, say a new customer has bunch of partitioned tables which each
currently have only one partition (for the current month), and that's expected
to grow to at least 20+ partitions (2+ years of history). How does one set the
partitioned table's auto-analyze parameters to analyze whenever any child is
analyzed ? I don't think it should be needed to update it every month after
computing sum(child tuples).

Possibly you could allow that behavior for some special values of the
threshold. Like if autovacuum_analyze_threshold=-2, then analyze the parent
whenever any of its children are analyzed.

I think that use case and that need would be common, but I'd like to hear what
others think.

Having to constantly pay attention to whether a parent's
analyze_threshold/scale_factor is working as intended would surely be
an annoyance, so I tend to agree that we might need more than just the
ability to set analyze_threshold/scale_factor on parent tables.
However, I think we can at least start with being able to do
*something* here. :) Maybe others think that this shouldn't be
considered committable until we figure out a good analyze threshold
calculation formula to apply to parent tables.

Considering that, how about having, say, a
autovacuum_analyze_partition_parent_frequency, with string values
'default', 'partition'? -- 'default' assumes the same formula as
regular tables, whereas with 'partition', parent is analyzed as soon
as a partition is.

I assume you mean a reloption to be applied only to partitioned tables,

Your "partition" setting would mean that the scale/threshold values would have
no effect, which seems kind of unfortunate.

I think it should be called something else, and done differently, like maybe:
autovacuum_analyze_mode = {off,sum,max,...}

The above reloption you suggested will be applied all tables?
Users might not use it for partitions, so I think we should add "parent"
to reloption's name, like Amit's suggestion.

The threshold would be threshold + scale*tuples, as always, but would be
compared with f(changes) as determined by the relopt.

sum(changes) would do what you called "default", comparing the sum(changes)
across all partitions to the threshold, which is itself computed using
sum(reltuples) AS reltuples.

max(changes) would compute max(changes) compared to the threshold, and the
threshold would be computed separately for each partition's reltuples:
threshold_N = parent_threshold + parent_scale * part_N_tuples. If *any*
partition exceeds that threshold, the partition itself is analyzed. This
allows what I want for time-series. Maybe this would have an alias called
"any".

I may be wrong but I think the fomula,

threshold_N = parent_threshold + parent_scale * part_N_tuples

would use orginary table's threshold, not parent's. If it use parent_threshold,
parent might not be analyzed even if its any partition is analyzed when
parent_threshold is larger than normal threshold. I'm worried that this case
meets requirements for time-series.

I'm not sure if there's any other useful modes, like avg(changes)? I guess we
can add them later if someone thinks of a good use case.

Also, for me, the v7 patch warns:
|src/backend/postmaster/autovacuum.c:3117:70: warning: ‘reltuples’ may be used uninitialized in this function [-Wmaybe-uninitialized]
| vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples;
..which seems to be a false positive, but easily avoided.

Thank you for testing the patch.
I got it. I'll update the patch soon.

This patch includes partitioned tables in pg_stat_*_tables, which is great; I
complained awhile ago that they were missing [0]. It might be useful if that
part was split out into a separate 0001 patch (?).

If partitioned table's statistics is used for other purposes, I think
it would be
better to split the patch. Does anyone have any opinion?

---
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#36Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#28)
Re: Autovacuum on partitioned table (autoanalyze)

On Wed, Mar 18, 2020 at 11:30:39AM -0500, Justin Pryzby wrote:

In the past, I think there's was talk that maybe someone would invent a clever
way to dynamically combine all the partitions' statistics, so analyzing the
parent wasn't needed. [...]

I happened across the thread I was referring to:
/messages/by-id/7363.1426537103@sss.pgh.pa.us

I'm not opposed to doing things the currently-proposed way (trigger analyze of
partitioned tables based on updates, same as nonpartitioned tables), but we
should think if it's worth doing something totally different, like what Tom
proposed.

Robert had concerns that it would increase planning time. I imagine that
argument is even stronger now, since PG12 has *less* planning time for large
heirarchies (428b260f8) and advertizes support for "thousands" of partitions.

Tom said:

we would automatically get statistics that account for
partitions being eliminated by constraint exclusion, because only the
non-eliminated partitions are present in the appendrel. And second,

That's a pretty strong benefit. I don't know if there's a good way to support
both(either) ways of doing things. Like maybe a reloption that allows
triggering autovacuum on partitioned tables, but if no statistics exist on a
partitioned table, then the planner would dynamically determine the selectivity
by decending into child statistics (Tom's way). I think the usual way this
would play out is that someone with a small partition heirarchies would
eventually complain about high planning time and then we'd suggest implementing
a manual ANALYZE job.

I'm not sure it's good to support two ways anyway, since 1) I think that gives
different (better) statistics Tom's way (due to excluding stats of excluded
partitions); 2) there's not a good way to put an ANALYZE job in place and then
get rid of parent stats (have to DELETE FROM pg_statistic WHERE
starelid='...'::regclass; 3) if someone implements an ANALYZE job, but they
disable it or it stops working then they have outdated stats forever;

--
Justin

#37Amit Langote
amitlangote09@gmail.com
In reply to: Justin Pryzby (#36)
Re: Autovacuum on partitioned table (autoanalyze)

On Sat, Apr 25, 2020 at 11:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Mar 18, 2020 at 11:30:39AM -0500, Justin Pryzby wrote:

In the past, I think there's was talk that maybe someone would invent a clever
way to dynamically combine all the partitions' statistics, so analyzing the
parent wasn't needed. [...]

I happened across the thread I was referring to:
/messages/by-id/7363.1426537103@sss.pgh.pa.us

I'm not opposed to doing things the currently-proposed way (trigger analyze of
partitioned tables based on updates, same as nonpartitioned tables), but we
should think if it's worth doing something totally different, like what Tom
proposed.

Robert had concerns that it would increase planning time. I imagine that
argument is even stronger now, since PG12 has *less* planning time for large
heirarchies (428b260f8) and advertizes support for "thousands" of partitions.

Tom said:

we would automatically get statistics that account for
partitions being eliminated by constraint exclusion, because only the
non-eliminated partitions are present in the appendrel. And second,

That's a pretty strong benefit. I don't know if there's a good way to support
both(either) ways of doing things. Like maybe a reloption that allows
triggering autovacuum on partitioned tables, but if no statistics exist on a
partitioned table, then the planner would dynamically determine the selectivity
by decending into child statistics (Tom's way). I think the usual way this
would play out is that someone with a small partition heirarchies would
eventually complain about high planning time and then we'd suggest implementing
a manual ANALYZE job.

I'm not sure it's good to support two ways anyway, since 1) I think that gives
different (better) statistics Tom's way (due to excluding stats of excluded
partitions); 2) there's not a good way to put an ANALYZE job in place and then
get rid of parent stats (have to DELETE FROM pg_statistic WHERE
starelid='...'::regclass; 3) if someone implements an ANALYZE job, but they
disable it or it stops working then they have outdated stats forever;

Thanks for sharing that thread, had not seen it before.

I remember discussing with Alvaro and Hosoya-san an approach of
generating the whole-tree pg_statistics entries by combining the
children's entries, not during planning as the linked thread
discusses, but inside autovacuum. The motivation for that design was
the complaint that we scan the children twice with the current method
of generating whole-tree statistics -- first to generate their own
statistics and then again to generate the parent's.

Aside from how hard it would be to actually implement, that approach
also doesn't address the concern about when to generate the whole-tree
statistics. Because the linked thread mentions getting rid of the
whole-tree statistics altogether, there is no such concern if we go
its way. Although I do agree with Robert's assertion on that thread
that making every query on a parent a bit slower would not be a good
compromise.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#38Daniel Gustafsson
daniel@yesql.se
In reply to: yuzuko (#35)
Re: Autovacuum on partitioned table (autoanalyze)

On 21 Apr 2020, at 18:21, yuzuko <yuzukohosoya@gmail.com> wrote:

I'll update the patch soon.

Do you have an updated version to submit? The previous patch no longer applies
to HEAD, so I'm marking this entry Waiting on Author in the meantime.

cheers ./daniel

#39yuzuko
yuzukohosoya@gmail.com
In reply to: Daniel Gustafsson (#38)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

On Wed, Jul 1, 2020 at 6:26 PM Daniel Gustafsson <daniel@yesql.se> wrote:

On 21 Apr 2020, at 18:21, yuzuko <yuzukohosoya@gmail.com> wrote:

I'll update the patch soon.

Do you have an updated version to submit? The previous patch no longer applies
to HEAD, so I'm marking this entry Waiting on Author in the meantime.

Thank you for letting me know.
I attach the latest patch applies to HEAD.

I think there are other approaches like Tom's idea that Justin previously
referenced, but this patch works the same way as previous patches.
(tracks updated/inserted/deleted tuples and checks whether the partitioned
tables needs auto-analyze, same as nonpartitioned tables)
Because I wanted to be able to analyze partitioned tables by autovacuum
as a first step, and I think this approach is the simplest way to do it.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v8_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v8_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index dc688c415f..1f212b4d68 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1322,8 +1322,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..78cc3a0e84 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -585,7 +585,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -605,7 +605,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..de7f1c3bb1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,13 +655,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32de23e62..0c9319467f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -829,6 +829,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
 															  vrel->va_cols));
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60..41ef280646 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2032,11 +2033,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2059,7 +2060,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2722,6 +2724,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -2996,6 +2999,38 @@ relation_needs_vacanalyze(Oid relid,
 	AssertArg(OidIsValid(relid));
 
 	/*
+	 * If the relation is a partitioned table, we must add up childrens'
+	 * reltuples.
+	 */
+	if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *children;
+		ListCell *lc;
+
+		reltuples = 0;
+
+		/* Find all members of inheritance set taking AccessShareLock */
+		children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+		foreach(lc, children)
+		{
+			Oid        childOID = lfirst_oid(lc);
+			HeapTuple  childtuple;
+			Form_pg_class childclass;
+
+			childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+			childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+			/* Skip foreign partitions */
+			if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+				continue;
+
+			/* Sum up the child's reltuples for its parent table */
+			reltuples += childclass->reltuples;
+		}
+	}
+
+	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
 	 * table), or the autovacuum GUC variables.
@@ -3071,7 +3106,8 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..116d24facb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -1529,8 +1530,12 @@ pgstat_report_analyze(Relation rel,
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* partitioned tables don't have any data, so it's 0 */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* partitioned tables don't have any data, so it's 0 */
+		: deadtuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -1807,7 +1812,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) ||
+		relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -2003,6 +2009,28 @@ pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2024,6 +2052,28 @@ pgstat_count_heap_update(Relation rel, bool hot)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2049,6 +2099,28 @@ pgstat_count_heap_delete(Relation rel)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2203,6 +2275,29 @@ AtEOXact_PgStat(bool isCommit, bool parallel)
 				tabstat->t_counts.t_changed_tuples +=
 					trans->tuples_inserted + trans->tuples_updated +
 					trans->tuples_deleted;
+
+				/*
+				 * If this relation is partitioned, propagate its own
+				 * changed_tuples to their all ancestors.
+				 */
+				if (tabstat->ancestors != NULL)
+				{
+					int i = 0;
+
+					for(;;)
+					{
+						PgStat_TableStatus *entry;
+						Oid                relid = tabstat->ancestors[i];
+
+						if(relid == InvalidOid)
+							break;
+
+						entry = get_tabstat_entry(relid, false);
+						entry->t_counts.t_changed_tuples +=
+							tabstat->t_counts.t_changed_tuples;
+						++i;
+					}
+				}
 			}
 			else
 			{
@@ -6354,7 +6449,6 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
-
 /* ----------
  * pgstat_recv_archiver() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..239e7e688a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -156,6 +156,7 @@ typedef enum PgStat_Single_Reset_Type
 typedef struct PgStat_TableStatus
 {
 	Oid			t_id;			/* table's OID */
+	Oid        *ancestors;      /* all ancestors */
 	bool		t_shared;		/* is it a shared catalog? */
 	struct PgStat_TableXactStatus *trans;	/* lowest subxact's counts */
 	PgStat_TableCounts t_counts;	/* event counts to be sent */
@@ -403,7 +404,6 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
-
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
  * ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..8c31310f9c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1792,7 +1792,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2151,7 +2151,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#40Daniel Gustafsson
daniel@yesql.se
In reply to: yuzuko (#39)
Re: Autovacuum on partitioned table (autoanalyze)

On 6 Jul 2020, at 12:35, yuzuko <yuzukohosoya@gmail.com> wrote:

On Wed, Jul 1, 2020 at 6:26 PM Daniel Gustafsson <daniel@yesql.se> wrote:

On 21 Apr 2020, at 18:21, yuzuko <yuzukohosoya@gmail.com> wrote:

I'll update the patch soon.

Do you have an updated version to submit? The previous patch no longer applies
to HEAD, so I'm marking this entry Waiting on Author in the meantime.

Thank you for letting me know.
I attach the latest patch applies to HEAD.

This version seems to fail under Werror which is used in the Travis builds:

autovacuum.c: In function ‘relation_needs_vacanalyze’:
autovacuum.c:3117:59: error: ‘reltuples’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
^
autovacuum.c:2972:9: note: ‘reltuples’ was declared here
float4 reltuples; /* pg_class.reltuples */
^

I've moved this patch to the next commitfest, but kept the status as Waiting on
Author. Please submit a new version of the patch.

cheers ./daniel

#41yuzuko
yuzukohosoya@gmail.com
In reply to: Daniel Gustafsson (#40)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

I'm sorry for the late reply.

This version seems to fail under Werror which is used in the Travis builds:

autovacuum.c: In function ‘relation_needs_vacanalyze’:
autovacuum.c:3117:59: error: ‘reltuples’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
^
autovacuum.c:2972:9: note: ‘reltuples’ was declared here
float4 reltuples; /* pg_class.reltuples */
^

I attach the latest patch that solves the above Werror.
Could you please check it again?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v9_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v9_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index dc688c415f..1f212b4d68 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1322,8 +1322,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbeab6..60f311f4f2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -585,7 +585,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -605,7 +605,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..4c91e48a21 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,13 +644,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 23eb605d4c..d274852a78 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -829,6 +829,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
 															  vrel->va_cols));
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c6ec657a93..3154d8fd4d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2036,11 +2037,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2063,7 +2064,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2726,6 +2728,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3075,7 +3078,40 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip foreign partitions */
+				if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+					continue;
+
+				/* Sum up the child's reltuples for its parent table */
+				reltuples += childclass->reltuples;
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..5730c418c9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -1529,8 +1530,12 @@ pgstat_report_analyze(Relation rel,
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* partitioned tables don't have any data, so it's 0 */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* partitioned tables don't have any data, so it's 0 */
+		: deadtuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
@@ -1807,7 +1812,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) ||
+		relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -2003,6 +2009,28 @@ pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2024,6 +2052,28 @@ pgstat_count_heap_update(Relation rel, bool hot)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2049,6 +2099,28 @@ pgstat_count_heap_delete(Relation rel)
 		/* We have to log the effect at the proper transactional level */
 		int			nest_level = GetCurrentTransactionNestLevel();
 
+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)
+		{
+			List      *ancestors;
+			ListCell  *lc;
+			int       i = 0;
+
+			ancestors = get_partition_ancestors(rel->rd_rel->oid);
+			pgstat_info->ancestors =
+				(Oid *) MemoryContextAllocZero(TopTransactionContext,
+											   sizeof(Oid) * (ancestors->length + 1));
+			foreach(lc, ancestors)
+			{
+				pgstat_info->ancestors[i] = lfirst_oid(lc);
+				++i;
+			}
+		}
+
 		if (pgstat_info->trans == NULL ||
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
@@ -2203,6 +2275,29 @@ AtEOXact_PgStat(bool isCommit, bool parallel)
 				tabstat->t_counts.t_changed_tuples +=
 					trans->tuples_inserted + trans->tuples_updated +
 					trans->tuples_deleted;
+
+				/*
+				 * If this relation is partitioned, propagate its own
+				 * changed_tuples to their all ancestors.
+				 */
+				if (tabstat->ancestors != NULL)
+				{
+					int i = 0;
+
+					for(;;)
+					{
+						PgStat_TableStatus *entry;
+						Oid                relid = tabstat->ancestors[i];
+
+						if(relid == InvalidOid)
+							break;
+
+						entry = get_tabstat_entry(relid, false);
+						entry->t_counts.t_changed_tuples +=
+							tabstat->t_counts.t_changed_tuples;
+						++i;
+					}
+				}
 			}
 			else
 			{
@@ -6358,7 +6453,6 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
-
 /* ----------
  * pgstat_recv_archiver() -
  *
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..239e7e688a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -156,6 +156,7 @@ typedef enum PgStat_Single_Reset_Type
 typedef struct PgStat_TableStatus
 {
 	Oid			t_id;			/* table's OID */
+	Oid        *ancestors;      /* all ancestors */
 	bool		t_shared;		/* is it a shared catalog? */
 	struct PgStat_TableXactStatus *trans;	/* lowest subxact's counts */
 	PgStat_TableCounts t_counts;	/* event counts to be sent */
@@ -403,7 +404,6 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
-
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
  * ----------
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 601734a6f1..14d8af91c1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1794,7 +1794,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2150,7 +2150,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#42Daniel Gustafsson
daniel@yesql.se
In reply to: yuzuko (#41)
Re: Autovacuum on partitioned table (autoanalyze)

On 17 Aug 2020, at 08:11, yuzuko <yuzukohosoya@gmail.com> wrote:

I'm sorry for the late reply.

This version seems to fail under Werror which is used in the Travis builds:

autovacuum.c: In function ‘relation_needs_vacanalyze’:
autovacuum.c:3117:59: error: ‘reltuples’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
^
autovacuum.c:2972:9: note: ‘reltuples’ was declared here
float4 reltuples; /* pg_class.reltuples */
^

I attach the latest patch that solves the above Werror.
Could you please check it again?

This version now pass the tests in the Travis pipeline as can be seen in the
link below, and is ready to be reviewed in the upcoming commitfest:

http://cfbot.cputube.org/yuzuko-hosoya.html

cheers ./daniel

#43Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#40)
Re: Autovacuum on partitioned table (autoanalyze)

At Tue, 25 Aug 2020 14:28:20 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in

I attach the latest patch that solves the above Werror.
Could you please check it again?

This version now pass the tests in the Travis pipeline as can be seen in the
link below, and is ready to be reviewed in the upcoming commitfest:

http://cfbot.cputube.org/yuzuko-hosoya.html

At Mon, 6 Jul 2020 19:35:37 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in

I think there are other approaches like Tom's idea that Justin previously
referenced, but this patch works the same way as previous patches.
(tracks updated/inserted/deleted tuples and checks whether the partitioned
tables needs auto-analyze, same as nonpartitioned tables)
Because I wanted to be able to analyze partitioned tables by autovacuum
as a first step, and I think this approach is the simplest way to do it.

I'm not sure if anything bad happen if parent and children are not
agree on statistics.

The requirement suggested here seems to be:

- We want to update parent's stats when any of its children gets its
stats updated. This is curucial especially for time-series
partitioning.

- However, we don't want analyze the whole-tree every time any of the
children was analyzed.

To achieve the both, stats-merging seems to the optimal solution.

Putting that aside, I had a brief look on the latest patch.

 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) ||
+		relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;

RELKIND_HAS_STORAGE(RELKIND_PARTITIONED_TABLE) is already false.
Maybe you wanted to do "&& relkind !=" instead:p

+		/*
+		 * If this relation is partitioned, we store all ancestors' oid
+		 * to propagate its changed_tuples to their parents when this
+		 * transaction is committed.
+		 */
+		if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)

If the relation was detached then attached to another partition within
a transaction, the ancestor list would get stale and the succeeding
modification to the relation propagates into wrong ancestors.

I think vacuum time is more appropriate to modify ancestors stats. It
seems to me that what Alvalo pointed isthe list-order-susceptible
manner of collecting children's modified tuples.

+ ? 0 /* partitioned tables don't have any data, so it's 0 */

If the comment is true, we shouldn't have non-zero t_changed_tuples,
too. I think the reason for the lines is something different.

# Oops! Time's up now.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#44yuzuko
yuzukohosoya@gmail.com
In reply to: Kyotaro Horiguchi (#43)
Re: Autovacuum on partitioned table (autoanalyze)

Horiguchi-san,

Thank you for reviewing.

On Tue, Sep 15, 2020 at 7:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 25 Aug 2020 14:28:20 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in

I attach the latest patch that solves the above Werror.
Could you please check it again?

This version now pass the tests in the Travis pipeline as can be seen in the
link below, and is ready to be reviewed in the upcoming commitfest:

http://cfbot.cputube.org/yuzuko-hosoya.html

At Mon, 6 Jul 2020 19:35:37 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in

I think there are other approaches like Tom's idea that Justin previously
referenced, but this patch works the same way as previous patches.
(tracks updated/inserted/deleted tuples and checks whether the partitioned
tables needs auto-analyze, same as nonpartitioned tables)
Because I wanted to be able to analyze partitioned tables by autovacuum
as a first step, and I think this approach is the simplest way to do it.

I'm not sure if anything bad happen if parent and children are not
agree on statistics.

The requirement suggested here seems to be:

- We want to update parent's stats when any of its children gets its
stats updated. This is curucial especially for time-series
partitioning.

- However, we don't want analyze the whole-tree every time any of the
children was analyzed.

To achieve the both, stats-merging seems to the optimal solution.

Putting that aside, I had a brief look on the latest patch.

/* We only count stats for things that have storage */
-       if (!RELKIND_HAS_STORAGE(relkind))
+       if (!RELKIND_HAS_STORAGE(relkind) ||
+               relkind == RELKIND_PARTITIONED_TABLE)
{
rel->pgstat_info = NULL;

RELKIND_HAS_STORAGE(RELKIND_PARTITIONED_TABLE) is already false.
Maybe you wanted to do "&& relkind !=" instead:p

Oh, indeed. I'll fix it.

+               /*
+                * If this relation is partitioned, we store all ancestors' oid
+                * to propagate its changed_tuples to their parents when this
+                * transaction is committed.
+                */
+               if (rel->rd_rel->relispartition && pgstat_info->ancestors == NULL)

If the relation was detached then attached to another partition within
a transaction, the ancestor list would get stale and the succeeding
modification to the relation propagates into wrong ancestors.

I think vacuum time is more appropriate to modify ancestors stats. It
seems to me that what Alvalo pointed isthe list-order-susceptible
manner of collecting children's modified tuples.

I proposed a patch that modified ancestors stats when vacuuming previously.
In that time, having been pointed out by Alvaro and Amit, I tried to update the
parents' changes_since_analyze in every ANALYZE. However, in that case,
the problem mentioned in [1]/messages/by-id/CAKkQ50-bwFEDMBGb1JmDXffXsiU8xk-hN6kJK9CKjdBa7r=Hdw@mail.gmail.com -- Best regards, Yuzuko Hosoya occurred, but I could not find a way to avoid it.
I think that it can be solved by updating the parents' changes_since_analyze
only in the case of auto analyze, but what do you think?

+ ? 0 /* partitioned tables don't have any data, so it's 0 */

If the comment is true, we shouldn't have non-zero t_changed_tuples,
too. I think the reason for the lines is something different.

Yes, surely. I think updating the values of live_tuples and dead_tuples
is confusing for users. I'll consider another comment.

[1]: /messages/by-id/CAKkQ50-bwFEDMBGb1JmDXffXsiU8xk-hN6kJK9CKjdBa7r=Hdw@mail.gmail.com -- Best regards, Yuzuko Hosoya
--
Best regards,
Yuzuko Hosoya

NTT Open Source Software Center

#45yuzuko
yuzukohosoya@gmail.com
In reply to: yuzuko (#44)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Hello,

I reconsidered a way based on the v5 patch in line with
Horiguchi-san's comment.

This approach is as follows:
- A partitioned table is checked whether it needs analyze like a plain
table in relation_needs_vacanalyze(). To do this, we should store
partitioned table's stats (changes_since_analyze).
- Partitioned table's changes_since_analyze is updated when
analyze a leaf partition by propagating its changes_since_analyze.
In the next scheduled analyze time, it is used in the above process.
That is, the partitioned table is analyzed behind leaf partitions.
- The propagation process differs between autoanalyze or plain analyze.
In autoanalyze, a leaf partition's changes_since_analyze is propagated
to *all* ancestors. Whereas, in plain analyze on an inheritance tree,
propagates to ancestors not included the tree to avoid needless counting.

Attach the latest patch to this email.
Could you check it again please?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v10_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v10_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index fd6777ae01..e66e3c269b 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1322,8 +1322,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 85cd147e21..35e4618fb3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -591,7 +591,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -611,7 +611,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..98907137aa 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -88,7 +88,7 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
-						   bool inh, bool in_outer_xact, int elevel);
+						   bool inh, Oid toprel_oid, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
 								HeapTuple *rows, int numrows,
@@ -117,7 +117,8 @@ static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
  */
 void
 analyze_rel(Oid relid, RangeVar *relation,
-			VacuumParams *params, List *va_cols, bool in_outer_xact,
+			VacuumParams *params, List *va_cols,
+			Oid toprel_oid, bool in_outer_xact,
 			BufferAccessStrategy bstrategy)
 {
 	Relation	onerel;
@@ -258,14 +259,14 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc,
-					   relpages, false, in_outer_xact, elevel);
+					   relpages, false, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+					   true, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -288,8 +289,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 static void
 do_analyze_rel(Relation onerel, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
-			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   BlockNumber relpages, bool inh, Oid toprel_oid,
+			   bool in_outer_xact, int elevel)
 {
 	int			attr_cnt,
 				tcnt,
@@ -644,15 +645,13 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+							  (va_cols == NIL), toprel_oid);
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..8b956127aa 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -464,7 +464,7 @@ vacuum(List *relations, VacuumParams *params,
 				}
 
 				analyze_rel(vrel->oid, vrel->relation, params,
-							vrel->va_cols, in_outer_xact, vac_strategy);
+							vrel->va_cols, vrel->toprel_oid, in_outer_xact, vac_strategy);
 
 				if (use_own_xacts)
 				{
@@ -791,7 +791,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 			oldcontext = MemoryContextSwitchTo(vac_context);
 			vacrels = lappend(vacrels, makeVacuumRelation(vrel->relation,
 														  relid,
-														  vrel->va_cols));
+														  vrel->va_cols, relid));
 			MemoryContextSwitchTo(oldcontext);
 		}
 
@@ -828,7 +828,9 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				oldcontext = MemoryContextSwitchTo(vac_context);
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
-															  vrel->va_cols));
+															  vrel->va_cols,
+															  relid));
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
@@ -894,7 +896,8 @@ get_all_vacuum_rels(int options)
 		oldcontext = MemoryContextSwitchTo(vac_context);
 		vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 													  relid,
-													  NIL));
+													  NIL,
+													  InvalidOid));
 		MemoryContextSwitchTo(oldcontext);
 	}
 
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..8338755659 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -805,12 +805,13 @@ makeGroupingSet(GroupingSetKind kind, List *content, int location)
  *	  create a VacuumRelation node
  */
 VacuumRelation *
-makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
+makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid)
 {
 	VacuumRelation *v = makeNode(VacuumRelation);
 
 	v->relation = relation;
 	v->oid = oid;
 	v->va_cols = va_cols;
+	v->toprel_oid = toprel_oid;
 	return v;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 480d168346..1d94934c80 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10576,7 +10576,7 @@ opt_name_list:
 vacuum_relation:
 			qualified_name opt_name_list
 				{
-					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2);
+					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2, InvalidOid);
 				}
 		;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..1aaddbf823 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2052,11 +2053,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2079,7 +2080,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2742,6 +2744,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3091,7 +3094,40 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip foreign partitions */
+				if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+					continue;
+
+				/* Sum up the child's reltuples for its parent table */
+				reltuples += childclass->reltuples;
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
@@ -3155,7 +3191,7 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
 
 	/* Set up one VacuumRelation target, identified by OID, for vacuum() */
 	rangevar = makeRangeVar(tab->at_nspname, tab->at_relname, -1);
-	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL);
+	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL, InvalidOid);
 	rel_list = list_make1(rel);
 
 	vacuum(rel_list, &tab->at_params, bstrategy, true);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc62..aedebffa1e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -360,6 +361,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1556,12 +1558,14 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Support only changes_since_analyze for partitioned tables because
+ * autoanalyze on them requires that counter.
  * --------
  */
 void
 pgstat_report_analyze(Relation rel,
 					  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-					  bool resetcounter)
+					  bool resetcounter, Oid toprel_oid)
 {
 	PgStat_MsgAnalyze msg;
 
@@ -1594,18 +1598,87 @@ pgstat_report_analyze(Relation rel,
 		deadtuples = Max(deadtuples, 0);
 	}
 
+	/*
+	 * If this rel is a leaf partition, add its current changes_since_analyze
+	 * into its ancestors' counts.  This must be done before sending the ANALYZE
+	 * message as it resets the partition's changes_since_analyze counter.
+	 */
+	if (rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	{
+		List     *ancestors;
+		ListCell *lc;
+		PgStat_StatDBEntry  *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/*
+		 * Get its all ancestors to propagate changes_since_analyze count.
+		 * However, when ANALYZE inheritance tree, we get ancestors of
+		 * toprel_oid to avoid needless counting.
+		 */
+		if (!OidIsValid(toprel_oid))
+			ancestors = get_partition_ancestors(RelationGetRelid(rel));
+		else
+			ancestors = get_partition_ancestors(toprel_oid);
+
+		foreach(lc, ancestors)
+		{
+			Oid     parentOid = lfirst_oid(lc);
+			Relation parentrel;
+
+			parentrel = table_open(parentOid, AccessShareLock);
+
+			/* Report changes_since_analyze to the stats collector */
+			pgstat_report_partchanges(parentrel, tabentry->changes_since_analyze);
+
+			table_close(parentrel, AccessShareLock);
+		}
+	}
+
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
 	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
 	msg.m_resetcounter = resetcounter;
 	msg.m_analyzetime = GetCurrentTimestamp();
-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* if this is a partitioned table, skip modifying */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* if this is a partitioned table, skip modifying */
+		: deadtuples;
 	pgstat_send(&msg, sizeof(msg));
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *
+ *  Called when a leaf partition is analyzed to tell the collector about
+ *  its parent's changed_tuples.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1918,7 +1991,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -4830,6 +4904,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6698,6 +6776,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d9475c9989..81af33aedf 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -282,8 +282,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
-						VacuumParams *params, List *va_cols, bool in_outer_xact,
-						BufferAccessStrategy bstrategy);
+						VacuumParams *params, List *va_cols, Oid top_parent,
+						bool in_outer_xact, BufferAccessStrategy bstrategy);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c57322a56d 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -103,6 +103,6 @@ extern DefElem *makeDefElemExtended(char *nameSpace, char *name, Node *arg,
 
 extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int location);
 
-extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
+extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid);
 
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 60c2f45466..c0123a3bfb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3242,6 +3242,7 @@ typedef struct VacuumRelation
 	RangeVar   *relation;		/* table name to process, or NULL */
 	Oid			oid;			/* table's OID; InvalidOid if not looked up */
 	List	   *va_cols;		/* list of column names, or NIL for all */
+	Oid         toprel_oid;     /* top level table's OID if ANALYZE inheritance tree */
 } VacuumRelation;
 
 /* ----------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f15..6bdae2469f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -60,6 +60,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -419,6 +420,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon
+ *                                  after ANALYZE of leaf partitions
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -637,6 +650,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -1380,8 +1394,8 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 								 PgStat_Counter livetuples, PgStat_Counter deadtuples);
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-								  bool resetcounter);
-
+								  bool resetcounter, Oid top_parent);
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 492cdcf74c..509231cb32 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1804,7 +1804,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2169,7 +2169,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#46Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: yuzuko (#45)
Re: Autovacuum on partitioned table (autoanalyze)

Thanks you for the new version.

At Fri, 23 Oct 2020 15:12:51 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in

Hello,

I reconsidered a way based on the v5 patch in line with
Horiguchi-san's comment.

This approach is as follows:
- A partitioned table is checked whether it needs analyze like a plain
table in relation_needs_vacanalyze(). To do this, we should store
partitioned table's stats (changes_since_analyze).
- Partitioned table's changes_since_analyze is updated when
analyze a leaf partition by propagating its changes_since_analyze.
In the next scheduled analyze time, it is used in the above process.
That is, the partitioned table is analyzed behind leaf partitions.
- The propagation process differs between autoanalyze or plain analyze.
In autoanalyze, a leaf partition's changes_since_analyze is propagated
to *all* ancestors. Whereas, in plain analyze on an inheritance tree,
propagates to ancestors not included the tree to avoid needless counting.

Attach the latest patch to this email.
Could you check it again please?

+		/*
+		 * Get its all ancestors to propagate changes_since_analyze count.
+		 * However, when ANALYZE inheritance tree, we get ancestors of
+		 * toprel_oid to avoid needless counting.
+		 */
+		if (!OidIsValid(toprel_oid))
+			ancestors = get_partition_ancestors(RelationGetRelid(rel));
+		else
+			ancestors = get_partition_ancestors(toprel_oid);

This comment doesn't explaining what the code intends but what the
code does.

The reason for the difference is that if we have a valid toprel_oid,
we analyze all descendants of the relation this time, and if we
propagate the number to the descendants of the top relation, the next
analyze on the relations could happen shortly than expected.

-	msg.m_live_tuples = livetuples;
-	msg.m_dead_tuples = deadtuples;
+	msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0  /* if this is a partitioned table, skip modifying */
+		: livetuples;
+	msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		? 0 /* if this is a partitioned table, skip modifying */
+		: deadtuples;

Two successive branching with the same condition looks odd. And we
need an explanation of *why* we don't set the values for partitioned
tables.

+		foreach(lc, ancestors)
+		{
+			Oid     parentOid = lfirst_oid(lc);
+			Relation parentrel;
+
+			parentrel = table_open(parentOid, AccessShareLock);

I'm not sure, but all of the ancestors always cannot be a parent (in
other words, a parent of a parent of mine is not a parent of
mine). Isn't just rel sufficient?

-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,

This still rejects traditional inheritance nonleaf relations. But if
we remove the description about that completely in the comment above,
we should support traditional inheritance parents here. I think we
can do that as far as we need to propagate only per-tuple stats (that
mans not per-attribute) like changes_since_analyze.

Whichever way we take, do we need the description about the behavior
in the documentation?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#47Justin Pryzby
pryzby@telsasoft.com
In reply to: yuzuko (#45)
Re: Autovacuum on partitioned table (autoanalyze)

On Fri, Oct 23, 2020 at 03:12:51PM +0900, yuzuko wrote:

Hello,

I reconsidered a way based on the v5 patch in line with
Horiguchi-san's comment.

This approach is as follows:
- A partitioned table is checked whether it needs analyze like a plain
table in relation_needs_vacanalyze(). To do this, we should store
partitioned table's stats (changes_since_analyze).
- Partitioned table's changes_since_analyze is updated when
analyze a leaf partition by propagating its changes_since_analyze.
In the next scheduled analyze time, it is used in the above process.
That is, the partitioned table is analyzed behind leaf partitions.
- The propagation process differs between autoanalyze or plain analyze.
In autoanalyze, a leaf partition's changes_since_analyze is propagated
to *all* ancestors. Whereas, in plain analyze on an inheritance tree,
propagates to ancestors not included the tree to avoid needless counting.

+                * Get its all ancestors to propagate changes_since_analyze count.
+                * However, when ANALYZE inheritance tree, we get ancestors of
+                * toprel_oid to avoid needless counting.

=> I don't understand that comment.

+                       /* Find all members of inheritance set taking AccessShareLock */
+                       children = find_all_inheritors(relid, AccessShareLock, NULL);

=> Do you know that returns the table itself ? And in pg14dev, each
partitioned table has reltuples = -1, not zero...

+                               /* Skip foreign partitions */
+                               if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+                                       continue;

=> Michael's suggrestion is to use RELKIND_HAS_STORAGE to skip both foreign and
partitioned tables.

Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think
you need to copy it anyway - just call ReleaseSysCache().

Regarding the counters in pg_stat_all_tables: maybe some of these should be
null rather than zero ? Or else you should make an 0001 patch to fully
implement this view, with all relevant counters, not just n_mod_since_analyze,
last_*analyze, and *analyze_count. These are specifically misleading:

last_vacuum |
last_autovacuum |
n_ins_since_vacuum | 0
vacuum_count | 0
autovacuum_count | 0

--
Justin

#48yuzuko
yuzukohosoya@gmail.com
In reply to: Kyotaro Horiguchi (#46)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Horiguchi-san,

Thank you for your comments.

On Fri, Oct 23, 2020 at 8:23 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks you for the new version.

At Fri, 23 Oct 2020 15:12:51 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in

Hello,

I reconsidered a way based on the v5 patch in line with
Horiguchi-san's comment.

This approach is as follows:
- A partitioned table is checked whether it needs analyze like a plain
table in relation_needs_vacanalyze(). To do this, we should store
partitioned table's stats (changes_since_analyze).
- Partitioned table's changes_since_analyze is updated when
analyze a leaf partition by propagating its changes_since_analyze.
In the next scheduled analyze time, it is used in the above process.
That is, the partitioned table is analyzed behind leaf partitions.
- The propagation process differs between autoanalyze or plain analyze.
In autoanalyze, a leaf partition's changes_since_analyze is propagated
to *all* ancestors. Whereas, in plain analyze on an inheritance tree,
propagates to ancestors not included the tree to avoid needless counting.

Attach the latest patch to this email.
Could you check it again please?

+               /*
+                * Get its all ancestors to propagate changes_since_analyze count.
+                * However, when ANALYZE inheritance tree, we get ancestors of
+                * toprel_oid to avoid needless counting.
+                */
+               if (!OidIsValid(toprel_oid))
+                       ancestors = get_partition_ancestors(RelationGetRelid(rel));
+               else
+                       ancestors = get_partition_ancestors(toprel_oid);

This comment doesn't explaining what the code intends but what the
code does.

The reason for the difference is that if we have a valid toprel_oid,
we analyze all descendants of the relation this time, and if we
propagate the number to the descendants of the top relation, the next
analyze on the relations could happen shortly than expected.

I modified this comment according to your advice.

-       msg.m_live_tuples = livetuples;
-       msg.m_dead_tuples = deadtuples;
+       msg.m_live_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+               ? 0  /* if this is a partitioned table, skip modifying */
+               : livetuples;
+       msg.m_dead_tuples = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+               ? 0 /* if this is a partitioned table, skip modifying */
+               : deadtuples;

Two successive branching with the same condition looks odd. And we
need an explanation of *why* we don't set the values for partitioned
tables.

I moved this part to the previous block that livetuples and deadtuples are set.
Actually, I think the reason why those counters are set 0 when the given
relation is a partitioned table is that such a table doesn't have any data.
About changes_since_analyze counter, we should support exceptionally
in order to check whether partitioned tables need auto analyze.
I added this description to the comment of this function.

+               foreach(lc, ancestors)
+               {
+                       Oid     parentOid = lfirst_oid(lc);
+                       Relation parentrel;
+
+                       parentrel = table_open(parentOid, AccessShareLock);

I'm not sure, but all of the ancestors always cannot be a parent (in
other words, a parent of a parent of mine is not a parent of
mine). Isn't just rel sufficient?

I changed 'parentrel' to 'rel'.

-        * Report ANALYZE to the stats collector, too.  However, if doing
-        * inherited stats we shouldn't report, because the stats collector only
-        * tracks per-table stats.  Reset the changes_since_analyze counter only
-        * if we analyzed all columns; otherwise, there is still work for
-        * auto-analyze to do.
+        * Report ANALYZE to the stats collector, too.  Reset the
+        * changes_since_analyze counter only if we analyzed all columns;
+        * otherwise, there is still work for auto-analyze to do.
*/
-       if (!inh)
+       if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
pgstat_report_analyze(onerel, totalrows, totaldeadrows,

This still rejects traditional inheritance nonleaf relations. But if
we remove the description about that completely in the comment above,
we should support traditional inheritance parents here. I think we
can do that as far as we need to propagate only per-tuple stats (that
mans not per-attribute) like changes_since_analyze.

Regarding manual ANALYZE, not auto ANALYZE, when analyzing declarative
partitioning, all children are also analyzed at the same time. However,
in the case of traditional inheritance, we need to run that command on
each child table individually. That is, they are not analyzed all together
by ANALYZE. So I tried to support auto analyze for declarative
partitioning for now.
Added that we only support declarative partitioning to that comment.

Whichever way we take, do we need the description about the behavior
in the documentation?

Added a description about this patch to the document.

I attach the latest patch to this email.
Could you please check it again?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v11_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v11_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..50b55f5d01 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -817,6 +817,18 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
    </para>
 
    <para>
+    For declaratively partitioned tables, only analyze is supported.
+    The same <quote>analyze threshold</quote> defined above is used,
+    but the number of tuples is sum of their childrens'
+    <structname>pg_class</structname>.<structfield>reltuples</structfield>.
+    Also, total number of tuples inserted, updated, or deleted since the last
+    <command>ANALYZE</command> compared with the threshold is calculated by adding up
+    childrens' number of tuples analyzed in the previous <command>ANALYZE</command>.
+    This is because partitioned tables don't have any data.  So analyze on partitioned
+    tables are one lap behind their children.
+   </para>
+
+   <para>
     Temporary tables cannot be accessed by autovacuum.  Therefore,
     appropriate vacuum and analyze operations should be performed via
     session SQL commands.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index bc59a2d77d..d94caa4b7e 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1323,8 +1323,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b6..f1982d0f77 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -591,7 +591,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -611,7 +611,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..44ff01adf5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -88,7 +88,7 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
-						   bool inh, bool in_outer_xact, int elevel);
+						   bool inh, Oid toprel_oid, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
 								HeapTuple *rows, int numrows,
@@ -117,7 +117,8 @@ static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
  */
 void
 analyze_rel(Oid relid, RangeVar *relation,
-			VacuumParams *params, List *va_cols, bool in_outer_xact,
+			VacuumParams *params, List *va_cols,
+			Oid toprel_oid, bool in_outer_xact,
 			BufferAccessStrategy bstrategy)
 {
 	Relation	onerel;
@@ -258,14 +259,14 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc,
-					   relpages, false, in_outer_xact, elevel);
+					   relpages, false, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+					   true, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -288,8 +289,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 static void
 do_analyze_rel(Relation onerel, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
-			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   BlockNumber relpages, bool inh, Oid toprel_oid,
+			   bool in_outer_xact, int elevel)
 {
 	int			attr_cnt,
 				tcnt,
@@ -644,15 +645,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Regarding inherited stats,
+	 * we report only in the case of declarative partitioning.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
+							  (va_cols == NIL), toprel_oid);
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b6717f727..f5770afa9a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -464,7 +464,7 @@ vacuum(List *relations, VacuumParams *params,
 				}
 
 				analyze_rel(vrel->oid, vrel->relation, params,
-							vrel->va_cols, in_outer_xact, vac_strategy);
+							vrel->va_cols, vrel->toprel_oid, in_outer_xact, vac_strategy);
 
 				if (use_own_xacts)
 				{
@@ -791,7 +791,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 			oldcontext = MemoryContextSwitchTo(vac_context);
 			vacrels = lappend(vacrels, makeVacuumRelation(vrel->relation,
 														  relid,
-														  vrel->va_cols));
+														  vrel->va_cols, relid));
 			MemoryContextSwitchTo(oldcontext);
 		}
 
@@ -828,7 +828,9 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				oldcontext = MemoryContextSwitchTo(vac_context);
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
-															  vrel->va_cols));
+															  vrel->va_cols,
+															  relid));
+
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
@@ -894,7 +896,8 @@ get_all_vacuum_rels(int options)
 		oldcontext = MemoryContextSwitchTo(vac_context);
 		vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 													  relid,
-													  NIL));
+													  NIL,
+													  InvalidOid));
 		MemoryContextSwitchTo(oldcontext);
 	}
 
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index ee033ae779..770a83c0ae 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -806,12 +806,13 @@ makeGroupingSet(GroupingSetKind kind, List *content, int location)
  *	  create a VacuumRelation node
  */
 VacuumRelation *
-makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
+makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid)
 {
 	VacuumRelation *v = makeNode(VacuumRelation);
 
 	v->relation = relation;
 	v->oid = oid;
 	v->va_cols = va_cols;
+	v->toprel_oid = toprel_oid;
 	return v;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 95e256883b..b038a25ecc 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10568,7 +10568,7 @@ opt_name_list:
 vacuum_relation:
 			qualified_name opt_name_list
 				{
-					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2);
+					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2, InvalidOid);
 				}
 		;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..9b46ad20e4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2052,11 +2053,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2079,7 +2080,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2742,6 +2744,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3091,7 +3094,41 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (!RELKIND_HAS_STORAGE(childclass->relkind))
+					continue;
+
+				/* Sum up the child's reltuples for its parent table */
+				reltuples += childclass->reltuples;
+				ReleaseSysCache(childtuple);
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
@@ -3155,7 +3192,7 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
 
 	/* Set up one VacuumRelation target, identified by OID, for vacuum() */
 	rangevar = makeRangeVar(tab->at_nspname, tab->at_relname, -1);
-	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL);
+	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL, InvalidOid);
 	rel_list = list_make1(rel);
 
 	vacuum(rel_list, &tab->at_params, bstrategy, true);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f1dca2f25b..4c2571d2e2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -360,6 +361,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1556,12 +1558,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
 pgstat_report_analyze(Relation rel,
 					  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-					  bool resetcounter)
+					  bool resetcounter, Oid toprel_oid)
 {
 	PgStat_MsgAnalyze msg;
 
@@ -1576,22 +1581,72 @@ pgstat_report_analyze(Relation rel,
 	 * off these counts from what we send to the collector now, else they'll
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
-	 * committing.)
+	 * committing.)  However, for partitioned tables, we will not report both
+	 * livetuples and deadtuples because those tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
+		}
+	}
+
+	/*
+	 * If this rel is a leaf partition, add its current changes_since_analyze
+	 * into its ancestors' counts.  This must be done before sending the ANALYZE
+	 * message as it resets the partition's changes_since_analyze counter.
+	 */
+	if (rel->rd_rel->relispartition &&
+		!(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+	{
+		List     *ancestors;
+		ListCell *lc;
+		PgStat_StatDBEntry  *dbentry;
+		PgStat_StatTabEntry *tabentry;
+
+		/* Fetch the pgstat for this table */
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+
+		/*
+		 * Get its all ancestors to propagate changes_since_analyze count.
+		 * However, when we have a valid toprel_oid, that is ANALYZE inheritance
+		 * tree, if we propagate the number to all ancestors, the next analyze
+		 * on partitioned tables in the tree could happen shortly expected.
+		 * So we get ancestors of toprel_oid which are not analyzed this time.
+		 */
+		if (!OidIsValid(toprel_oid))
+			ancestors = get_partition_ancestors(RelationGetRelid(rel));
+		else
+			ancestors = get_partition_ancestors(toprel_oid);
+
+		foreach(lc, ancestors)
+		{
+			Oid     reloid = lfirst_oid(lc);
+			Relation rel;
+
+			rel = table_open(reloid, AccessShareLock);
+
+			/* Report changes_since_analyze to the stats collector */
+			pgstat_report_partchanges(rel, tabentry->changes_since_analyze);
+
+			table_close(rel, AccessShareLock);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1606,6 +1661,30 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *
+ *  Called when a leaf partition is analyzed to tell the collector about
+ *  its parent's changed_tuples.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1921,7 +2000,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -4833,6 +4913,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6701,6 +6785,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..ae09ab6e6a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -281,8 +281,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
-						VacuumParams *params, List *va_cols, bool in_outer_xact,
-						BufferAccessStrategy bstrategy);
+						VacuumParams *params, List *va_cols, Oid top_parent,
+						bool in_outer_xact, BufferAccessStrategy bstrategy);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 7ebd794713..7f1e647596 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -104,6 +104,6 @@ extern DefElem *makeDefElemExtended(char *nameSpace, char *name, Node *arg,
 
 extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int location);
 
-extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
+extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid);
 
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7ef9b0eac0..4b29e8c012 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3238,6 +3238,7 @@ typedef struct VacuumRelation
 	RangeVar   *relation;		/* table name to process, or NULL */
 	Oid			oid;			/* table's OID; InvalidOid if not looked up */
 	List	   *va_cols;		/* list of column names, or NIL for all */
+	Oid         toprel_oid;     /* top level table's OID if ANALYZE inheritance tree */
 } VacuumRelation;
 
 /* ----------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..31276ac1bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -60,6 +60,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -419,6 +420,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon
+ *                                  after ANALYZE of leaf partitions
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -640,6 +653,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -1386,8 +1400,8 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 								 PgStat_Counter livetuples, PgStat_Counter deadtuples);
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-								  bool resetcounter);
-
+								  bool resetcounter, Oid top_parent);
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 097ff5d111..e19e510245 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1804,7 +1804,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2172,7 +2172,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#49yuzuko
yuzukohosoya@gmail.com
In reply to: Justin Pryzby (#47)
Re: Autovacuum on partitioned table (autoanalyze)

Hi Justin,

Thank you for your comments.
I attached the latest patch(v11) to the previous email.

+                * Get its all ancestors to propagate changes_since_analyze count.
+                * However, when ANALYZE inheritance tree, we get ancestors of
+                * toprel_oid to avoid needless counting.

=> I don't understand that comment.

I fixed that comment.

+                       /* Find all members of inheritance set taking AccessShareLock */
+                       children = find_all_inheritors(relid, AccessShareLock, NULL);

=> Do you know that returns the table itself ? And in pg14dev, each
partitioned table has reltuples = -1, not zero...

+                               /* Skip foreign partitions */
+                               if (childclass->relkind == RELKIND_FOREIGN_TABLE)
+                                       continue;

=> Michael's suggrestion is to use RELKIND_HAS_STORAGE to skip both foreign and
partitioned tables.

I overlooked that. Revised that according to your comments.

Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think
you need to copy it anyway - just call ReleaseSysCache().

Fixed it.

Regarding the counters in pg_stat_all_tables: maybe some of these should be
null rather than zero ? Or else you should make an 0001 patch to fully
implement this view, with all relevant counters, not just n_mod_since_analyze,
last_*analyze, and *analyze_count. These are specifically misleading:

last_vacuum |
last_autovacuum |
n_ins_since_vacuum | 0
vacuum_count | 0
autovacuum_count | 0

I haven't modified this part yet, but you meant that we should set
null to counters
about vacuum because partitioned tables are not vacuumed?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#50Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: yuzuko (#49)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

At Thu, 5 Nov 2020 16:03:12 +0900, yuzuko <yuzukohosoya@gmail.com> wrote in

Hi Justin,

Thank you for your comments.
I attached the latest patch(v11) to the previous email.

+                * Get its all ancestors to propagate changes_since_analyze count.
+                * However, when ANALYZE inheritance tree, we get ancestors of
+                * toprel_oid to avoid needless counting.

=> I don't understand that comment.

I fixed that comment.

+		 * Get its all ancestors to propagate changes_since_analyze count.
+		 * However, when we have a valid toprel_oid, that is ANALYZE inheritance
+		 * tree, if we propagate the number to all ancestors, the next analyze
+		 * on partitioned tables in the tree could happen shortly expected.
+		 * So we get ancestors of toprel_oid which are not analyzed this time.

In second thought about the reason for the "toprel_oid". It is perhaps
to avoid "wrongly" propagated values to ancestors after a manual
ANALYZE on a partitioned table. But the same happens after an
autoanalyze iteration if some of the ancestors of a leaf relation are
analyzed before the leaf relation in a autoanalyze iteration. That
can trigger an unnecessary analyzing for some of the ancestors.
So we need to do a similar thing for autovacuum, However..

[1(root):analzye]-[2:DONT analyze]-[3:analyze]-[leaf]

In this case topre_oid is invalid (since it's autoanalyze) but we
should avoid propagating the count to 1 and 3 if it is processed
*before* the leaf, but should propagate to 2. toprel_oid doesn't work
in that case.

So, to propagate the count properly, we need to analyze relations
leaf-to-root order, or propagate the counter only to anscestors that
haven't been processed in the current iteration. It seems a bit too
complex to sort analyze relations in that order. The latter would be
relatively simple. See the attached for how it looks like.

Anyway, either way we take, it is not pgstat.c's responsibility to do
that since the former need to heavily reliant to what analyze does,
and the latter need to know what anlyze is doing.

Also, you called SearchSysCacheCopy1, but didn't free the tuple. I don't think
you need to copy it anyway - just call ReleaseSysCache().

Fixed it.

Mmm. Unfortunately, that fix leaks cache reference when
!RELKIND_HAS_STORAGE.

Regarding the counters in pg_stat_all_tables: maybe some of these should be
null rather than zero ? Or else you should make an 0001 patch to fully
implement this view, with all relevant counters, not just n_mod_since_analyze,
last_*analyze, and *analyze_count. These are specifically misleading:

last_vacuum |
last_autovacuum |
n_ins_since_vacuum | 0
vacuum_count | 0
autovacuum_count | 0

I haven't modified this part yet, but you meant that we should set
null to counters
about vacuum because partitioned tables are not vacuumed?

Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure
what is the best way here. Showing null seems reasonable but I'm not
sure that doesn't break anything.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v11_autovacuum_on_partitioned_table_mod.patchtext/x-patch; charset=us-asciiDownload
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..50b55f5d01 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -816,6 +816,18 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
     since the last <command>ANALYZE</command>.
    </para>
 
+   <para>
+    For declaratively partitioned tables, only analyze is supported.
+    The same <quote>analyze threshold</quote> defined above is used,
+    but the number of tuples is sum of their childrens'
+    <structname>pg_class</structname>.<structfield>reltuples</structfield>.
+    Also, total number of tuples inserted, updated, or deleted since the last
+    <command>ANALYZE</command> compared with the threshold is calculated by adding up
+    childrens' number of tuples analyzed in the previous <command>ANALYZE</command>.
+    This is because partitioned tables don't have any data.  So analyze on partitioned
+    tables are one lap behind their children.
+   </para>
+
    <para>
     Temporary tables cannot be accessed by autovacuum.  Therefore,
     appropriate vacuum and analyze operations should be performed via
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index bc59a2d77d..d94caa4b7e 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1323,8 +1323,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b6..f1982d0f77 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -591,7 +591,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -611,7 +611,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..8758a24955 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -30,6 +30,7 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
+#include "catalog/partition.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
@@ -38,6 +39,7 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
+#include "common/hashfn.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
@@ -107,6 +109,45 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+typedef struct analyze_oident
+{
+	Oid oid;
+	char status;
+} analyze_oident;
+
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX analyze_oids
+#define SH_ELEMENT_TYPE analyze_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+#define ANALYZED_OIDS_HASH_SIZE 128
+analyze_oids_hash *analyzed_reloids = NULL;
+
+void
+analyze_init_status(void)
+{
+	if (analyzed_reloids)
+		analyze_oids_destroy(analyzed_reloids);
+
+	analyzed_reloids = analyze_oids_create(CurrentMemoryContext,
+										   ANALYZED_OIDS_HASH_SIZE, NULL);
+}
+
+void
+analyze_destroy_status(void)
+{
+	if (analyzed_reloids)
+		analyze_oids_destroy(analyzed_reloids);
+
+	analyzed_reloids = NULL;
+}
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -312,6 +353,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	bool		found;
 
 	if (inh)
 		ereport(elevel,
@@ -644,16 +686,67 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Regarding inherited stats,
+	 * we report only in the case of declarative partitioning.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *ancestors;
+		ListCell *lc;
+		Datum	oiddatum = ObjectIdGetDatum(RelationGetRelid(onerel));
+		Datum	countdatum;
+		int64	change_count;
+
+		/*
+		 * Read current value of n_mod_since_analyze of this relation.  This
+		 * might be a bit stale but we don't need such correctness here.
+		 */
+		countdatum =
+			DirectFunctionCall1(pg_stat_get_mod_since_analyze, oiddatum);
+		change_count = DatumGetInt64(countdatum);
+
+		/* collect all ancestors of this relation */
+		ancestors = get_partition_ancestors(RelationGetRelid(onerel));
+
+		/*
+		 * To let partitioned relations be analyzed, we need to update
+		 * change_since_analyze also for partitioned relations, which don't
+		 * have storage.  We move the count of leaf-relations to ancestors
+		 * before resetting.  We could instead bump up the counter of all
+		 * ancestors every time leaf relations are updated but that is too
+		 * complex.
+		 */
+		foreach (lc, ancestors)
+		{
+			Oid toreloid = lfirst_oid(lc);
+
+			/*
+			 * Don't propagate the count to anscestors that have already been
+			 * analyzed in this analyze command or this iteration of
+			 * autoanalyze.
+			 */
+			if (analyze_oids_lookup(analyzed_reloids, toreloid) == NULL)
+			{
+				Relation rel;
+
+				rel = table_open(toreloid, AccessShareLock);
+				pgstat_report_partchanges(rel, change_count);
+				table_close(rel, AccessShareLock);
+			}
+
+		}
+
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
+		list_free(ancestors);
+	}
+
+	/* Recrod this relatoin as "analyzed"  */
+	analyze_oids_insert(analyzed_reloids, onerel->rd_id, &found);
+
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
 	{
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b6717f727..9cc7c1bb4f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -437,6 +437,7 @@ vacuum(List *relations, VacuumParams *params,
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 
+		analyze_init_status();
 		/*
 		 * Loop to process each selected relation.
 		 */
@@ -487,6 +488,7 @@ vacuum(List *relations, VacuumParams *params,
 	{
 		in_vacuum = false;
 		VacuumCostActive = false;
+		analyze_destroy_status();
 	}
 	PG_END_TRY();
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..bb568f68b5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2052,11 +2053,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2079,7 +2080,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2742,6 +2744,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3091,7 +3094,42 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+
+				ReleaseSysCache(childtuple);
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..633c5743fb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -360,6 +360,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1556,6 +1557,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1576,22 +1580,29 @@ pgstat_report_analyze(Relation rel,
 	 * off these counts from what we send to the collector now, else they'll
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
-	 * committing.)
+	 * committing.)  However, for partitioned tables, we will not report both
+	 * livetuples and deadtuples because those tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1605,6 +1616,30 @@ pgstat_report_analyze(Relation rel,
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* --------
+ * pgstat_report_partchanges() -
+ *
+ *
+ *  Called when a leaf partition is analyzed to tell the collector about
+ *  its parent's changed_tuples.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* --------
  * pgstat_report_recovery_conflict() -
  *
@@ -1921,7 +1956,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -4833,6 +4869,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6701,6 +6741,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..c6dcf23898 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -280,6 +280,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 int options, bool verbose, LOCKMODE lmode);
 
 /* in commands/analyze.c */
+extern void analyze_init_status(void);
+extern void analyze_destroy_status(void);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..dd7faf9861 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -60,6 +60,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -419,6 +420,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon
+ *                                  after ANALYZE of leaf partitions
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -640,6 +653,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -1387,7 +1401,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 097ff5d111..e19e510245 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1804,7 +1804,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2172,7 +2172,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#51Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Kyotaro Horiguchi (#50)
Re: Autovacuum on partitioned table (autoanalyze)

I looked at both Yuzuko Hosoya's patch and Kyotaro Horiguchi's, and
think we're doing things in a quite complicated manner, which perhaps
could be done more easily.

Hosoya's patch has pgstat_report_analyze call pgstat_get_tab_entry() for
the table being vacuumed, then obtains the list of ancestors, and then
sends for each ancestor a new message containing the partition's
changes_since_analyze for that ancestor. When stat collector receives
that message, it adds the number to the ancestor's m_changed_tuples.

Horiguchi's doing a similar thing, only differently: it is do_analyze_rel
that reads the count from the collector (this time by calling SQL
function pg_stat_get_mod_since_analyze) and then sends number back to
the collector for each ancestor.

I suggest that a better way to do this, is to forget about the new
"partchanges" message completely. Instead, let's add an array of
ancestors to the analyze message (borrowing from PgStat_MsgFuncstat).
Something like this:

#define PGSTAT_NUM_ANCESTORENTRIES \
((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(bool) - \
sizeof(bool) - sizeof(TimestampTz) - sizeof(PgStat_Counter) - \
sizeof(PgStat_Counter) - sizeof(int)) / sizeof(Oid))
typedef struct PgStat_MsgAnalyze
{
PgStat_MsgHdr m_hdr;
Oid m_databaseid;
Oid m_tableoid;
bool m_autovacuum;
bool m_resetcounter;
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
int m_nancestors;
Oid m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
} PgStat_MsgAnalyze;

For non-partitions, m_nancestors would be 0, so the message would be
handled as today. For partitions, the array carries the OID of all
ancestors. When the collector receives this message, first it looks up
the pgstat entries for each ancestors in the array, and it adds the
partition's current changes_since_analyze to the ancestor's
changes_since_analyze. Then it does things as currently, including
reset the changes_since_analyze counter for the partition.

Key point in this is that we don't need to read the number from
collector into the backend executing analyze. We just *send* the data
about ancestors, and the collector knows what to do with it.

One possible complaint is: what if there are more ancestors that fit in
the message? I propose that this problem can be ignored, since in order
to hit this, you'd need to have (1000-8-4-4-1-1-8-8-8-4)/4 = 238
ancestors (if my math is right). I doubt we'll hit the need to use 238
levels of partitionings before a stat collector rewrite occurs ...

(It is possible to remove that restriction by doing more complicated
things such as sending the list of ancestor in a new type of message
that can be sent several times, prior to the analyze message itself, but
I don't think this is worth the trouble.)

#52Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Kyotaro Horiguchi (#50)
Re: Autovacuum on partitioned table (autoanalyze)

On 2020-Nov-10, Kyotaro Horiguchi wrote:

In second thought about the reason for the "toprel_oid". It is perhaps
to avoid "wrongly" propagated values to ancestors after a manual
ANALYZE on a partitioned table. But the same happens after an
autoanalyze iteration if some of the ancestors of a leaf relation are
analyzed before the leaf relation in a autoanalyze iteration. That
can trigger an unnecessary analyzing for some of the ancestors.

I'm not sure I understand this point. I think we should only trigger
this on analyzes of *leaf* partitions, not intermediate partitioned
relations. That way you would never get these unnecesary analyzes.
Am I missing something?

(So with my proposal in the previous email, we would send the list of
ancestor relations after analyzing a leaf partition. Whenever we
analyze a non-leaf, then the list of ancestors is sent as an empty
list.)

Regarding the counters in pg_stat_all_tables: maybe some of these should be
null rather than zero ? Or else you should make an 0001 patch to fully
implement this view, with all relevant counters, not just n_mod_since_analyze,
last_*analyze, and *analyze_count. These are specifically misleading:

last_vacuum |
last_autovacuum |
n_ins_since_vacuum | 0
vacuum_count | 0
autovacuum_count | 0

I haven't modified this part yet, but you meant that we should set
null to counters
about vacuum because partitioned tables are not vacuumed?

Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure
what is the best way here. Showing null seems reasonable but I'm not
sure that doesn't break anything.

I agree that showing NULLs for the vacuum columns is better. Perhaps
the most reasonable way to do this is use -1 as an indicator that NULL
ought to be returned from pg_stat_get_vacuum_count() et al, and add a
boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum"
that says to store -1 instead of 0 in pgstat_recv_tabstat.

#53yuzuko
yuzukohosoya@gmail.com
In reply to: Alvaro Herrera (#52)
Re: Autovacuum on partitioned table (autoanalyze)

Hello Alvaro,

Thank you for your comments.

In second thought about the reason for the "toprel_oid". It is perhaps
to avoid "wrongly" propagated values to ancestors after a manual
ANALYZE on a partitioned table. But the same happens after an
autoanalyze iteration if some of the ancestors of a leaf relation are
analyzed before the leaf relation in a autoanalyze iteration. That
can trigger an unnecessary analyzing for some of the ancestors.

I'm not sure I understand this point. I think we should only trigger
this on analyzes of *leaf* partitions, not intermediate partitioned
relations. That way you would never get these unnecesary analyzes.
Am I missing something?

(So with my proposal in the previous email, we would send the list of
ancestor relations after analyzing a leaf partition. Whenever we
analyze a non-leaf, then the list of ancestors is sent as an empty
list.)

The problem Horiguchi-san mentioned is as follows:

create table p1 (i int) partition by range(i);
create table p1_1 partition of p1 for values from (0) to (500)
partition by range(i);
create table p1_1_1 partition of p1_1 for values from (0) to (300);
insert into p1 select generate_series(0,299);

-- After auto analyze (first time)
postgres=# select relname, n_mod_since_analyze, last_autoanalyze from
pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1');
relname | n_mod_since_analyze | last_autoanalyze
---------+---------------------+-------------------------------
p1 | 300 |
p1_1 | 300 |
p1_1_1 | 0 | 2020-12-02 22:24:18.753574+09
(3 rows)

-- Insert more rows
postgres=# insert into p1 select generate_series(0,199);
postgres=# select relname, n_mod_since_analyze, last_autoanalyze from
pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1');
relname | n_mod_since_analyze | last_autoanalyze
---------+---------------------+-------------------------------
p1 | 300 |
p1_1 | 300 |
p1_1_1 | 200 | 2020-12-02 22:24:18.753574+09
(3 rows)

-- After auto analyze (second time)
postgres=# select relname, n_mod_since_analyze, last_autoanalyze from
pg_stat_all_tables where relname in ('p1','p1_1','p1_1_1');
relname | n_mod_since_analyze | last_autoanalyze
---------+---------------------+-------------------------------
p1 | 0 | 2020-12-02 22:25:18.857248+09
p1_1 | 200 | 2020-12-02 22:25:18.661932+09
p1_1_1 | 0 | 2020-12-02 22:25:18.792078+09

After 2nd auto analyze, all relations' n_mod_since_analyze should be 0,
but p1_1's is not. This is because p1_1 was analyzed before p1_1_1.
So p1_1 will be analyzed again in the 3rd auto analyze.
That is propagating changes_since_analyze to *all ancestors* may cause
unnecessary analyzes even if we do this only when analyzing leaf partitions.
So I think we should track ancestors which are not analyzed in the current
iteration as Horiguchi-san proposed.

Regarding your idea:

typedef struct PgStat_MsgAnalyze
{
PgStat_MsgHdr m_hdr;
Oid m_databaseid;
Oid m_tableoid;
bool m_autovacuum;
bool m_resetcounter;
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
int m_nancestors;
Oid m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
} PgStat_MsgAnalyze;

I'm not sure but how about storing only ancestors that aren't analyzed
in the current
iteration in m_ancestors[PGSTAT_NUM_ANCESTORENTRIES] ?

Regarding the counters in pg_stat_all_tables: maybe some of these should be
null rather than zero ? Or else you should make an 0001 patch to fully
implement this view, with all relevant counters, not just n_mod_since_analyze,
last_*analyze, and *analyze_count. These are specifically misleading:

last_vacuum |
last_autovacuum |
n_ins_since_vacuum | 0
vacuum_count | 0
autovacuum_count | 0

I haven't modified this part yet, but you meant that we should set
null to counters
about vacuum because partitioned tables are not vacuumed?

Perhaps bacause partitioned tables *cannot* be vacuumed. I'm not sure
what is the best way here. Showing null seems reasonable but I'm not
sure that doesn't break anything.

I agree that showing NULLs for the vacuum columns is better. Perhaps
the most reasonable way to do this is use -1 as an indicator that NULL
ought to be returned from pg_stat_get_vacuum_count() et al, and add a
boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum"
that says to store -1 instead of 0 in pgstat_recv_tabstat.

Thank you for the advice. I'll fix it based on this idea.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#54Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: yuzuko (#53)
Re: Autovacuum on partitioned table (autoanalyze)

Hello Yuzuko,

On 2020-Dec-02, yuzuko wrote:

The problem Horiguchi-san mentioned is as follows:
[explanation]

Hmm, I see. So the problem is that if some ancestor is analyzed first,
then analyze of one of its partition will cause a redundant analyze of
the ancestor, because the number of tuples that is propagated from the
partition represents a set that had already been included in the
ancestor's analysis.

If the problem was just that, then I think it would be very simple to
solve: just make sure to sort the tables to vacuum so that all leaves
are vacuumed first, and then all ancestors, sorted from the bottom up.
Problem solved.

But I'm not sure that that's the whole story, for two reasons: one, two
workers can run simultaneously, where one analyzes the partition and the
other analyzes the ancestor. Then the order is not guaranteed (and
each process will get no effect from remembering whether it did that one
or not). Second, manual analyzes can occur in any order.

Maybe it's more useful to think about this in terms of rememebering that
partition P had changed_tuples set to N when we analyzed ancestor A.
Then, when we analyze partition P, we send the message listing A as
ancestor; on receipt of that message, we see M+N changed tuples in P,
but we know that we had already seen N, so we only record M.

I'm not sure how to implement this idea however, since on analyze of
ancestor A we don't have the list of partitions, so we can't know the N
for each partition.

#55yuzuko
yuzukohosoya@gmail.com
In reply to: Alvaro Herrera (#54)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Hello Alvaro,

On Thu, Dec 3, 2020 at 10:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hello Yuzuko,

On 2020-Dec-02, yuzuko wrote:

The problem Horiguchi-san mentioned is as follows:
[explanation]

Hmm, I see. So the problem is that if some ancestor is analyzed first,
then analyze of one of its partition will cause a redundant analyze of
the ancestor, because the number of tuples that is propagated from the
partition represents a set that had already been included in the
ancestor's analysis.

If the problem was just that, then I think it would be very simple to
solve: just make sure to sort the tables to vacuum so that all leaves
are vacuumed first, and then all ancestors, sorted from the bottom up.
Problem solved.

Indeed. When discussed with Horiguchi-san before, He mentioned
the same way:

So, to propagate the count properly, we need to analyze relations
leaf-to-root order, or propagate the counter only to anscestors that
haven't been processed in the current iteration. It seems a bit too
complex to sort analyze relations in that order.

but we didn't select it because of its complexity as you also said.

But I'm not sure that that's the whole story, for two reasons: one, two
workers can run simultaneously, where one analyzes the partition and the
other analyzes the ancestor. Then the order is not guaranteed (and
each process will get no effect from remembering whether it did that one
or not). Second, manual analyzes can occur in any order.

Maybe it's more useful to think about this in terms of rememebering that
partition P had changed_tuples set to N when we analyzed ancestor A.
Then, when we analyze partition P, we send the message listing A as
ancestor; on receipt of that message, we see M+N changed tuples in P,
but we know that we had already seen N, so we only record M.

I'm not sure how to implement this idea however, since on analyze of
ancestor A we don't have the list of partitions, so we can't know the N
for each partition.

I thought about it for a while, but I can't come up with how to implement it.
And also I think the other way Horiguchi-san suggested in [1]/messages/by-id/20201110.203557.1420746510378864931.horikyota.ntt@gmail.com would be
more simple to solve the problem we are facing.

Attach the new patch based on his patch. What do you think?

[1]: /messages/by-id/20201110.203557.1420746510378864931.horikyota.ntt@gmail.com

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v12_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v12_autovacuum_on_partitioned_table.patchDownload
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..50b55f5d01 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -817,6 +817,18 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
    </para>
 
    <para>
+    For declaratively partitioned tables, only analyze is supported.
+    The same <quote>analyze threshold</quote> defined above is used,
+    but the number of tuples is sum of their childrens'
+    <structname>pg_class</structname>.<structfield>reltuples</structfield>.
+    Also, total number of tuples inserted, updated, or deleted since the last
+    <command>ANALYZE</command> compared with the threshold is calculated by adding up
+    childrens' number of tuples analyzed in the previous <command>ANALYZE</command>.
+    This is because partitioned tables don't have any data.  So analyze on partitioned
+    tables are one lap behind their children.
+   </para>
+
+   <para>
     Temporary tables cannot be accessed by autovacuum.  Therefore,
     appropriate vacuum and analyze operations should be performed via
     session SQL commands.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 569f4c9da7..b4ad435966 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1337,8 +1337,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8ccc228a8c..35bc2e5bdb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1961,13 +1961,12 @@ build_local_reloptions(local_relopts *relopts, Datum options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210bc..d7762aa3eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -591,7 +591,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -611,7 +611,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..9feb21f660 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -30,6 +30,7 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
+#include "catalog/partition.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
@@ -38,6 +39,7 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
+#include "common/hashfn.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
@@ -107,6 +109,45 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+typedef struct analyze_oident
+{
+	Oid oid;
+	char status;
+} analyze_oident;
+
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX analyze_oids
+#define SH_ELEMENT_TYPE analyze_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+#define ANALYZED_OIDS_HASH_SIZE 128
+analyze_oids_hash *analyzed_reloids = NULL;
+
+void
+analyze_init_status(void)
+{
+	if (analyzed_reloids)
+		analyze_oids_destroy(analyzed_reloids);
+
+	analyzed_reloids = analyze_oids_create(CurrentMemoryContext,
+										   ANALYZED_OIDS_HASH_SIZE, NULL);
+}
+
+void
+analyze_destroy_status(void)
+{
+	if (analyzed_reloids)
+		analyze_oids_destroy(analyzed_reloids);
+
+	analyzed_reloids = NULL;
+}
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -312,6 +353,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	bool        found;
 
 	if (inh)
 		ereport(elevel,
@@ -644,15 +686,70 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Report ANALYZE to the stats collector, too.  However, if doing
-	 * inherited stats we shouldn't report, because the stats collector only
-	 * tracks per-table stats.  Reset the changes_since_analyze counter only
-	 * if we analyzed all columns; otherwise, there is still work for
-	 * auto-analyze to do.
+	 * Report ANALYZE to the stats collector, too.  Regarding inherited stats,
+	 * we report only in the case of declarative partitioning.  Reset the
+	 * changes_since_analyze counter only if we analyzed all columns;
+	 * otherwise, there is still work for auto-analyze to do.
 	 */
-	if (!inh)
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		List     *ancestors;
+		ListCell *lc;
+		Datum	oiddatum = ObjectIdGetDatum(RelationGetRelid(onerel));
+		Datum	countdatum;
+		int64	change_count;
+
+		if (onerel->rd_rel->relispartition &&
+			!(onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+		{
+
+			/* collect all ancestors of this relation */
+			ancestors = get_partition_ancestors(RelationGetRelid(onerel));
+
+			/*
+			 * Read current value of n_mod_since_analyze of this relation.  This
+			 * might be a bit stale but we don't need such correctness here.
+			 */
+			countdatum =
+				DirectFunctionCall1(pg_stat_get_mod_since_analyze, oiddatum);
+			change_count = DatumGetInt64(countdatum);
+
+			/*
+			 * To let partitioned relations be analyzed, we need to update
+			 * change_since_analyze also for partitioned relations, which don't
+			 * have storage.  We move the count of leaf-relations to ancestors
+			 * before resetting.  We could instead bump up the counter of all
+			 * ancestors every time leaf relations are updated but that is too
+			 * complex.
+			 */
+			foreach (lc, ancestors)
+			{
+				Oid toreloid = lfirst_oid(lc);
+
+				/*
+				 * Don't propagate the count to anscestors that have already been
+				 * analyzed in this analyze command or this iteration of
+				 * autoanalyze.
+				 */
+				if (analyze_oids_lookup(analyzed_reloids, toreloid) == NULL)
+				{
+					Relation rel;
+
+					rel = table_open(toreloid, AccessShareLock);
+					pgstat_report_partchanges(rel, change_count);
+					table_close(rel, AccessShareLock);
+				}
+
+			}
+			list_free(ancestors);
+		}
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
+	}
+
+	/* Record this relation as "analyzed"  */
+	analyze_oids_insert(analyzed_reloids, onerel->rd_id, &found);
+
 
 	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
 	if (!(params->options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..336d9e297a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -437,6 +437,7 @@ vacuum(List *relations, VacuumParams *params,
 		VacuumSharedCostBalance = NULL;
 		VacuumActiveNWorkers = NULL;
 
+		analyze_init_status();
 		/*
 		 * Loop to process each selected relation.
 		 */
@@ -487,6 +488,7 @@ vacuum(List *relations, VacuumParams *params,
 	{
 		in_vacuum = false;
 		VacuumCostActive = false;
+		analyze_destroy_status();
 	}
 	PG_END_TRY();
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e28944d2f..3c18602e76 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -2056,11 +2057,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2083,7 +2084,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2746,6 +2748,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3161,7 +3164,41 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+				ReleaseSysCache(childtuple);
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d21..76075007bf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -369,6 +369,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1567,6 +1568,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1587,22 +1591,29 @@ pgstat_report_analyze(Relation rel,
 	 * off these counts from what we send to the collector now, else they'll
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
-	 * committing.)
+	 * committing.)  However, for partitioned tables, we will not report both
+	 * livetuples and deadtuples because those tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1617,6 +1628,30 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *
+ *  Called when a leaf partition is analyzed to tell the collector about
+ *  its parent's changed_tuples.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1932,7 +1967,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -4870,6 +4906,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -6739,6 +6779,18 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..c6dcf23898 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -280,6 +280,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 int options, bool verbose, LOCKMODE lmode);
 
 /* in commands/analyze.c */
+extern void analyze_init_status(void);
+extern void analyze_destroy_status(void);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..d5a0ec6467 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -60,6 +60,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -419,6 +420,18 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon
+ *                                  after ANALYZE of leaf partitions
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -643,6 +656,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -1393,7 +1407,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57bc..e44f82ec73 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1804,7 +1804,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2175,7 +2175,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#56David Steele
david@pgmasters.net
In reply to: yuzuko (#55)
Re: Autovacuum on partitioned table (autoanalyze)

On 12/14/20 8:46 PM, yuzuko wrote:

On Thu, Dec 3, 2020 at 10:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Attach the new patch based on his patch. What do you think?

Álvaro, Justin, Kyotaro, thoughts on this latest patch?

Regards,
--
-David
david@pgmasters.net

#57Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: David Steele (#56)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

I took a look at this patch. It does not apply because of 5f8727f5a67,
so a rebase is needed. But I want to talk about the general approach in
general, so it does not matter.

The thread is fairly long, both in terms of number of messages and time
(started in 2019), so let me restate my understanding of the problem and
what the patch aims to do.

The problem is that autovacuum never analyzes non-leaf relations in
partition hierarchies, because they never get modified and so the value
of changes_since_analyze remains 0. This applies both to partitioning
based on inheritance and the new fancy declarative partitioning. The
consequence is that we never have accurate statistics (MCV, histograms
and so on) for the parent, which may lead to poor query plans in cases
when we don't use the child statistics for some reason.

The patch aims for fix that by propagating the changes_since_analyze to
the parent relations, so that the autovacuum properly considers if those
non-leaf relations need analyze.

I think the goal is right, and propagating the changes_since_analyze is
the right solution, but as coded it has a couple issues that may cause
trouble in practice.

Firstly, the patch propagates the changes_since_analyze values from
do_analyze_rel, i.e. from the worker after it analyzes the relation.
That may easily lead to cases with unnecessary analyzes - consider a
partitioned with 4 child relations:

p1 [reltuples=1M, changes_since_analyze=400k]
p2 [reltuples=1M, changes_since_analyze=90k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

With the default analyze threshold (10%) this means autoanalyze of p1,
and then (in the next round) analyze of the whole partitioned table,
because 400k is 10% of 4M. So far so good - we're now in this state:

p1 [reltuples=1M, changes_since_analyze=0]
p2 [reltuples=1M, changes_since_analyze=90k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

Let's do ~310k more modifications to p2:

p1 [reltuples=1M, changes_since_analyze=0]
p2 [reltuples=1M, changes_since_analyze=400k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

Now p2 gets analyzed, and the value gets propagate to p, triggering the
analyze. But that's bogus - we've already seen 90k of those rows in the
last analyze, the "actual" changes_since_analyze is just 310k and that
should have not triggered the analyze.

I could invent a much more extreme examples with more partitions, and or
with multiple autovacuum workers processing the leaf rels concurrently.

This seems like a quite serious issue, because analyzes on partitioned
tables sample all the partitions, which seems rather expensive. That is
not an issue introduced by this patch, of course, but it's good to keep
that in mind and not make the matters worse.

Note: I do have some ideas about how to improve that, I've started a
separate thread about it [1]https://commitfest.postgresql.org/33/3052/.

IMHO the primary issue is the patch is trying to report the counts from
the workers, and it's done incrementally, after the fact. It tries to
prevent the issue by not propagating the counts to parents analyzed in
the same round, but that doesn't seems sufficient:

- There's no guarantee how long ago the parent was analyzed. Maybe it
was 1 second ago, or maybe it was 24h ago and there have been many new
modifications since then?

- The hash table is per worker, so who knows what did the other
autovacuum workers do?

So not really a good solution, I'm afraid.

I propose a different approach - instead of propagating the counts in
do_analyze_rel for individual leaf tables, let's do that in bulk in
relation_needs_vacanalyze. Before the (existing) first pass over
pg_class, we can add a new one, propagating counts from leaf tables to
parents. I'd imagine something like this

while ((tuple = heap_getnext(relScan, ... != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);

... find all ancestors for classForm ...

pgstat_propagate_changes(classForm, ancestors);
}

The pgstat_propagate_changes() simply informs the pgstat collector that
classForm has certain ancestors, and it propagates the value to all of
them. There's a problem, though - we can't reset the value for the leaf
table, because we need to check if it needs analyze, but we also don't
want to sent it again next time. But we can add another counter,
tracking that part of changes_since_analyze we already propagated, and
propagate only the difference. That is, we now have this:

PgStat_Counter changes_since_analyze;
PgStat_Counter changes_since_analyze_reported;

So for example we start with

changes_since_analyze = 10000;
changes_since_analyze_reported = 0;

and we propagate 10k to parents:

changes_since_analyze = 10000;
changes_since_analyze_reported = 10000;

but we don't analyze anything, and we accumulate 5k more changes:

changes_since_analyze = 15000;
changes_since_analyze_reported = 10000;

so now we propagate only the 5k delta. And so on. It's not exactly
atomic change (we still do this per relation), but it's "bulk" in the
sense that we force the propagation and don't wait until after the leaf
happens to be analyzed.

It might need to reread the stats file I think, to get the incremented
values, but that seems acceptable.

We may need to "sync" the counts for individual relations in a couple
places (e.g. after the worker is done with the leaf, it should propagate
the remaining delta before resetting the values to 0). Maybe multi-level
partitioning needs some additional handling, not sure.

regards

[1]: https://commitfest.postgresql.org/33/3052/

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#58Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#57)
Re: Autovacuum on partitioned table (autoanalyze)

On 3/30/21 4:09 AM, Tomas Vondra wrote:

Hi,

...

We may need to "sync" the counts for individual relations in a couple
places (e.g. after the worker is done with the leaf, it should propagate
the remaining delta before resetting the values to 0). Maybe multi-level
partitioning needs some additional handling, not sure.

I forgot to mention one additional thing yesterday - I wonder if we need
to do something similar after a partition is attached/detached. That can
also change the parent's statistics significantly, so maybe we should
handle all partition's rows as changes_since_analyze? Not necessarily
something this patch has to handle, but might be related.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59yuzuko
yuzukohosoya@gmail.com
In reply to: Tomas Vondra (#58)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Hi Tomas,

Thank you for reviewing the patch.

Firstly, the patch propagates the changes_since_analyze values from
do_analyze_rel, i.e. from the worker after it analyzes the relation.
That may easily lead to cases with unnecessary analyzes - consider a
partitioned with 4 child relations:
[ explanation ]

I didn't realize that till now. Indeed, this approach increments parent's
changes_since_analyze counter according to its leaf partition's counter
when the leaf partition is analyzed, so it will cause unnecessary ANALYZE
on partitioned tables as you described.

I propose a different approach - instead of propagating the counts in
do_analyze_rel for individual leaf tables, let's do that in bulk in
relation_needs_vacanalyze. Before the (existing) first pass over
pg_class, we can add a new one, propagating counts from leaf tables to
parents.

Thank you for your suggestion. I think it could solve all the issues
you mentioned. I modified the patch based on this approach:

- Create a new counter, PgStat_Counter changes_since_analyze_reported,
to track changes_since_analyze we already propagated to ancestors.
This is used for internal processing and users may not need to know it.
So this counter is not displayed at pg_stat_all_tables view for now.

- Create a new function, pgstat_propagate_changes() which propagates
changes_since_analyze counter to all ancestors and saves
changes_since_analyze_reported. This function is called in
do_autovacuum() before relation_needs_vacanalyze().

Note: I do have some ideas about how to improve that, I've started a
separate thread about it [1].

I'm also interested in merging children's statistics for partitioned tables
because it will make ANALYZE on inheritance trees more efficient.
So I'll check it later.

I forgot to mention one additional thing yesterday - I wonder if we need
to do something similar after a partition is attached/detached. That can
also change the parent's statistics significantly, so maybe we should
handle all partition's rows as changes_since_analyze? Not necessarily
something this patch has to handle, but might be related.

Regarding attached/detached partitions, I think we should update statistics
of partitioned tables according to the new inheritance tree. The latest patch
hasn't handled this case yet, but I'll give it a try soon.

Attach the v13 patch to this email. Could you please check it again?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v13_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v13_autovacuum_on_partitioned_table.patchDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d897bbe..54eba63 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP| RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1962,12 +1962,11 @@ bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d..fb41b06 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -660,7 +660,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -680,7 +680,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f84616d..35e9a2f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,20 +655,22 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
+	}
 
-		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
-		 */
+	/*
+	 * Now report ANALYZE to the stats collector.
+	 *
+	 * Regarding inherited stats, we report only in the case of declarative
+	 * partitioning.  For partitioning based on inheritance, stats collector
+	 * only tracks per-table stats.
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
-	}
+
 
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c..e715ceb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -74,7 +74,9 @@
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -351,6 +353,8 @@ static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
 									const char *nspname, const char *relname);
 static void avl_sigusr2_handler(SIGNAL_ARGS);
 static void autovac_refresh_stats(void);
+static void pgstat_propagate_changes(Form_pg_class classForm, PgStat_StatTabEntry *tabentry,
+									 List *ancestors);
 
 
 
@@ -2055,11 +2059,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is that
+	 * during it we want to use the main relation's pg_class.reloptions entry
+	 * if the TOAST table does not have any, and we cannot obtain it unless we
+	 * know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2080,9 +2084,11 @@ do_autovacuum(void)
 		bool		dovacuum;
 		bool		doanalyze;
 		bool		wraparound;
+		List        *ancestors;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2117,6 +2123,17 @@ do_autovacuum(void)
 		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
 											 shared, dbentry);
 
+		/*
+		 * If this relation is a leaf partition, collect all ancestors
+		 * and propagate changes_since_analyze counts to them.
+		 */
+		if (classForm->relispartition &&
+			!(classForm->relkind == RELKIND_PARTITIONED_TABLE))
+		{
+			ancestors = get_partition_ancestors(relid);
+			pgstat_propagate_changes(classForm, tabentry, ancestors);
+		}
+
 		/* Check if it needs vacuum or analyze */
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
@@ -2745,6 +2762,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3161,7 +3179,41 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up childrens'
+			 * reltuples.
+			 */
+			List     *children;
+			ListCell *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid        childOID = lfirst_oid(lc);
+				HeapTuple  childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+				ReleaseSysCache(childtuple);
+			}
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
@@ -3488,3 +3540,56 @@ autovac_refresh_stats(void)
 
 	pgstat_clear_snapshot();
 }
+
+/*
+ * pgstat_propagate_changes
+ *
+ *		Propagate changes_since_analyze counter to all of ancestors
+ *		to analyze partitioned tables automatically
+ *
+ * We can decide whether a partitioned table needs auto analyze according to 
+ * changes_since_analyze which is propagated from all of the leaf partitions.
+ * To know the correct difference of partitioned table from the last analyze,
+ * we should track changes_since_analyze_reported counter for leaf partitions
+ * as well as changes_since_analyze counter.  While changes_since_analyze 
+ * counter tracks the number of changed tuples from the last analyze per 
+ * partitions, changes_since_analyze_reported counter tracks changes_since_analyze
+ * we already propagated to ancestors.  Then, we propagate only the difference
+ * between these counters to the partitioned table.
+ */
+static void
+pgstat_propagate_changes(Form_pg_class classForm, PgStat_StatTabEntry *tabentry,
+						 List *ancestors)
+{
+
+	float4		anltuples,
+				anltuples_reported,
+				change_count;
+	ListCell   *lc;
+	Relation    parentrel,
+				childrel;
+
+	if (!PointerIsValid(tabentry))
+		return;
+
+	anltuples = tabentry->changes_since_analyze;
+	anltuples_reported = tabentry->changes_since_analyze_reported;
+	change_count = anltuples - anltuples_reported;
+
+	/* update changes_since_analyze of ancestors */
+	if (anltuples > 0 && change_count > 0)
+	{
+		foreach (lc, ancestors)
+		{
+			Oid relid = lfirst_oid(lc);
+			parentrel = table_open(relid, AccessShareLock);
+			pgstat_report_partchanges(parentrel, change_count);
+			table_close(parentrel, AccessShareLock);
+		}
+
+		/* update own changes_since_analyze_reported */
+		childrel = table_open(classForm->oid, AccessShareLock);
+		pgstat_report_reportedchanges(childrel, change_count);
+		table_close(childrel, AccessShareLock);
+	}
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4b9bcd2..9d72e6f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -373,6 +373,8 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
+static void pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1622,6 +1624,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1642,24 +1647,30 @@ pgstat_report_analyze(Relation rel,
 	 * off these counts from what we send to the collector now, else they'll
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
-	 * committing.)
+	 * committing.)  However, for partitioned tables, we will not report both
+	 * livetuples and deadtuples because those tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
 	}
-
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
@@ -1672,6 +1683,50 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *  Propagate changes_since_analyze counter from a leaf partition to its parent.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* --------
+ * pgstat_report_reportedchanges() -
+ *
+ *  Tell the collector changes_since_analyze counter we have already
+ *  propagated to its ancestors.
+ * --------
+ */
+void
+pgstat_report_reportedchanges(Relation rel, PgStat_Counter changed_tuples_reported)
+{
+	PgStat_MsgReportedChanges  msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	elog(DEBUG3, "%s, report reportedchanges", NameStr(rel->rd_rel->relname));
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPORTEDCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples_reported = changed_tuples_reported;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1986,7 +2041,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -5001,6 +5057,14 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
+				case PGSTAT_MTYPE_REPORTEDCHANGES:
+					pgstat_recv_reportedchanges(&msg.msg_reportedchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -5215,6 +5279,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
+		result->changes_since_analyze_reported = 0;
 		result->inserts_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
@@ -6477,6 +6542,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported = 0; //tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
@@ -6512,6 +6578,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported = 0; //tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
@@ -6868,7 +6935,10 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 * have no good way to estimate how many of those there were.
 	 */
 	if (msg->m_resetcounter)
+	{
 		tabentry->changes_since_analyze = 0;
+		tabentry->changes_since_analyze_reported = 0;
+	}
 
 	if (msg->m_autovacuum)
 	{
@@ -6882,6 +6952,34 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
+
+
+static void
+pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len)
+{
+	PgStat_StatDBEntry   *dbentry;
+	PgStat_StatTabEntry  *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze_reported += msg->m_changed_tuples_reported;
+}
+
+
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d699502..b0fd957 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -70,6 +70,8 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
+	PGSTAT_MTYPE_REPORTEDCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -127,6 +129,7 @@ typedef struct PgStat_TableCounts
 	PgStat_Counter t_delta_live_tuples;
 	PgStat_Counter t_delta_dead_tuples;
 	PgStat_Counter t_changed_tuples;
+	PgStat_Counter t_changed_tuples_reported;
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
@@ -430,6 +433,32 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon to propagate
+ *                                  the changed_tuples counter.
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
+
+/* ----------
+ * PgStat_MsgReportedChanges			Sent by the autovacuum deamon to update
+ *                                      changed_tuples_reported.
+ * ----------
+ */
+typedef struct PgStat_MsgReportedChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid           m_databaseid;
+	Oid           m_tableoid;
+	PgStat_Counter m_changed_tuples_reported;
+} PgStat_MsgReportedChanges;
+
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -675,6 +704,8 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
+	PgStat_MsgReportedChanges msg_reportedchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -770,6 +801,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
+	PgStat_Counter changes_since_analyze_reported;
 	PgStat_Counter inserts_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
@@ -1445,7 +1477,8 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changes_tuples);
+extern void pgstat_report_reportedchanges(Relation rel, PgStat_Counter changes_tuples_reported);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b..954afb9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1806,7 +1806,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2209,7 +2209,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#60Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: yuzuko (#59)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Thanks for the quick rework. I like this design much better and I think
this is pretty close to committable. Here's a rebased copy with some
small cleanups (most notably, avoid calling pgstat_propagate_changes
when the partition doesn't have a tabstat entry; also, free the lists
that are allocated in a couple of places).

I didn't actually verify that it works.

--
�lvaro Herrera Valdivia, Chile
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un gui�n que no toque nada para no causar da�os." (Jakob Nielsen)

Attachments:

v14_autovacuum_on_partitioned_tables.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d897bbec2b..5554275e64 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1962,12 +1962,11 @@ bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316..fb41b06539 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -660,7 +660,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -680,7 +680,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f84616d3d2..35e9a2fc17 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,20 +655,22 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
+	}
 
-		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
-		 */
+	/*
+	 * Now report ANALYZE to the stats collector.
+	 *
+	 * Regarding inherited stats, we report only in the case of declarative
+	 * partitioning.  For partitioning based on inheritance, stats collector
+	 * only tracks per-table stats.
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
-	}
+
 
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..7ca074a800 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -74,7 +74,9 @@
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -350,6 +352,8 @@ static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
 									const char *nspname, const char *relname);
 static void avl_sigusr2_handler(SIGNAL_ARGS);
+static void pgstat_propagate_changes(Form_pg_class classForm,
+									 PgStat_StatTabEntry *tabentry);
 static void autovac_refresh_stats(void);
 
 
@@ -2055,11 +2059,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is
+	 * that during it we want to use the main relation's pg_class.reloptions
+	 * entry if the TOAST table does not have any, and we cannot obtain it
+	 * unless we know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2082,7 +2086,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2117,6 +2122,16 @@ do_autovacuum(void)
 		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
 											 shared, dbentry);
 
+		/*
+		 * If this relation is a leaf partition, propagate
+		 * changes_since_analyze counts to all ancestors.
+		 */
+		if (classForm->relispartition && tabentry &&
+			!(classForm->relkind == RELKIND_PARTITIONED_TABLE))
+		{
+			pgstat_propagate_changes(classForm, tabentry);
+		}
+
 		/* Check if it needs vacuum or analyze */
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
@@ -2745,6 +2760,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3161,7 +3177,43 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			reltuples = classForm->reltuples;
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up
+			 * children's reltuples.
+			 */
+			List	   *children;
+			ListCell   *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid			childOID = lfirst_oid(lc);
+				HeapTuple	childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+				ReleaseSysCache(childtuple);
+			}
+
+			list_free(children);
+		}
+
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
@@ -3312,6 +3364,61 @@ autovac_report_workitem(AutoVacuumWorkItem *workitem,
 	pgstat_report_activity(STATE_RUNNING, activity);
 }
 
+/*
+ * pgstat_propagate_changes
+ *
+ *		Propagate changes_since_analyze counter to all of ancestors
+ *		to analyze partitioned tables automatically
+ *
+ * We can decide whether a partitioned table needs auto analyze according to
+ * changes_since_analyze which is propagated from all of the leaf partitions.
+ * To know the correct difference of partitioned table from the last analyze,
+ * we should track changes_since_analyze_reported counter for leaf partitions
+ * as well as changes_since_analyze counter.  While changes_since_analyze
+ * counter tracks the number of changed tuples from the last analyze per
+ * partitions, changes_since_analyze_reported counter tracks changes_since_analyze
+ * we already propagated to ancestors.  Then, we propagate only the difference
+ * between these counters to the partitioned table.
+ */
+static void
+pgstat_propagate_changes(Form_pg_class classForm, PgStat_StatTabEntry *tabentry)
+{
+
+	float4		anltuples,
+				anltuples_reported,
+				change_count;
+	List	   *ancestors;
+	ListCell   *lc;
+	Relation	parentrel,
+				childrel;
+
+	ancestors = get_partition_ancestors(classForm->oid);
+
+	anltuples = tabentry->changes_since_analyze;
+	anltuples_reported = tabentry->changes_since_analyze_reported;
+	change_count = anltuples - anltuples_reported;
+
+	/* update changes_since_analyze of ancestors */
+	if (anltuples > 0 && change_count > 0)
+	{
+		foreach(lc, ancestors)
+		{
+			Oid			relid = lfirst_oid(lc);
+
+			parentrel = table_open(relid, AccessShareLock);
+			pgstat_report_partchanges(parentrel, change_count);
+			table_close(parentrel, AccessShareLock);
+		}
+
+		/* update own changes_since_analyze_reported */
+		childrel = table_open(classForm->oid, AccessShareLock);
+		pgstat_report_reportedchanges(childrel, change_count);
+		table_close(childrel, AccessShareLock);
+	}
+
+	list_free(ancestors);
+}
+
 /*
  * AutoVacuumingActive
  *		Check GUC vars and report whether the autovacuum process should be
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4c4b072068..27316f598a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -343,6 +343,8 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
+static void pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1592,6 +1594,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1613,23 +1618,31 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
 	 * committing.)
+	 *
+	 * For partitioned tables, we don't report live and dead tuples, because
+	 * such tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
 	}
-
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
@@ -1641,6 +1654,49 @@ pgstat_report_analyze(Relation rel,
 	pgstat_send(&msg, sizeof(msg));
 }
 
+/* --------
+ * pgstat_report_partchanges() -
+ *
+ *  Propagate changes_since_analyze counter from a leaf partition to its parent.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* --------
+ * pgstat_report_reportedchanges() -
+ *
+ *  Tell the collector changes_since_analyze counter we have already
+ *  propagated to its ancestors.
+ * --------
+ */
+void
+pgstat_report_reportedchanges(Relation rel, PgStat_Counter changed_tuples_reported)
+{
+	PgStat_MsgReportedChanges msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPORTEDCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples_reported = changed_tuples_reported;
+	pgstat_send(&msg, sizeof(msg));
+}
+
 /* --------
  * pgstat_report_recovery_conflict() -
  *
@@ -1958,7 +2014,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -3287,6 +3344,14 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
+				case PGSTAT_MTYPE_REPORTEDCHANGES:
+					pgstat_recv_reportedchanges(&msg.msg_reportedchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -3501,6 +3566,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
+		result->changes_since_analyze_reported = 0;
 		result->inserts_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
@@ -4768,6 +4834,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported = 0;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
@@ -4803,6 +4870,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported = 0;
 			tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
@@ -5159,7 +5227,10 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 * have no good way to estimate how many of those there were.
 	 */
 	if (msg->m_resetcounter)
+	{
 		tabentry->changes_since_analyze = 0;
+		tabentry->changes_since_analyze_reported = 0;
+	}
 
 	if (msg->m_autovacuum)
 	{
@@ -5173,6 +5244,34 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
+
+
+static void
+pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze_reported += msg->m_changed_tuples_reported;
+}
+
+
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7cd137506e..194003d5d1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -69,6 +69,8 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
+	PGSTAT_MTYPE_REPORTEDCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -126,6 +128,7 @@ typedef struct PgStat_TableCounts
 	PgStat_Counter t_delta_live_tuples;
 	PgStat_Counter t_delta_dead_tuples;
 	PgStat_Counter t_changed_tuples;
+	PgStat_Counter t_changed_tuples_reported;
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
@@ -429,6 +432,32 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon to propagate
+ *                                  the changed_tuples counter.
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
+
+/* ----------
+ * PgStat_MsgReportedChanges			Sent by the autovacuum deamon to update
+ *                                      changed_tuples_reported.
+ * ----------
+ */
+typedef struct PgStat_MsgReportedChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples_reported;
+} PgStat_MsgReportedChanges;
+
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -674,6 +703,8 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
+	PgStat_MsgReportedChanges msg_reportedchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -769,6 +800,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
+	PgStat_Counter changes_since_analyze_reported;
 	PgStat_Counter inserts_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
@@ -975,7 +1007,8 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changes_tuples);
+extern void pgstat_report_reportedchanges(Relation rel, PgStat_Counter changes_tuples_reported);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b4a5..954afb9a45 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1806,7 +1806,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2209,7 +2209,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#61Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alvaro Herrera (#60)
Re: Autovacuum on partitioned table (autoanalyze)

On 4/3/21 9:42 PM, Alvaro Herrera wrote:

Thanks for the quick rework. I like this design much better and I think
this is pretty close to committable. Here's a rebased copy with some
small cleanups (most notably, avoid calling pgstat_propagate_changes
when the partition doesn't have a tabstat entry; also, free the lists
that are allocated in a couple of places).

I didn't actually verify that it works.

Yeah, this approach seems much simpler, I think. That being said, I
think there's a couple issues:

1) I still don't understand why inheritance and declarative partitioning
are treated differently. Seems unnecessary nad surprising, but maybe
there's a good reason?

2) pgstat_recv_tabstat

Should it really reset changes_since_analyze_reported in both branches?
AFAICS if the "found" branch does this

tabentry->changes_since_analyze_reported = 0;

that means we lose the counter any time tabstats are received, no?
That'd be wrong, because we'd propagate the changes repeatedly.

3) pgstat_recv_analyze

Shouldn't it propagate the counters before resetting them? I understand
that for the just-analyzed relation we can't do better, but why not to
propagate the counters to parents? (Not necessarily from this place in
the stat collector, maybe the analyze process should do that.)

4) pgstat_recv_reportedchanges

I think this needs to be more careful when updating the value - the
stats collector might have received other messages modifying those
counters (including e.g. PGSTAT_MTYPE_ANALYZE with a reset), so maybe we
can get into situation with

(changes_since_analyze_reported > changes_since_analyze)

if we just blindly increment the value. I'd bet would lead to funny
stuff. So maybe this should be careful to never exceed this?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#61)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-04, Tomas Vondra wrote:

1) I still don't understand why inheritance and declarative partitioning
are treated differently. Seems unnecessary nad surprising, but maybe
there's a good reason?

I suppose the rationale is that for inheritance we have always done it
that way -- similar things have been done that way for inheritance
historically, to avoid messing with long-standing behavior. We have
done that in a bunch of places -- DDL behavior, FKs, etc. Maybe in this
case it's not justified. It *will* change behavior, in the sense that
we are going to capture stats that have never been captured before.
That might or might not affect query plans for designs using regular
inheritance. But it seems reasonable to think that those changes will
be for the good; and if it does break plans for some people and they
want to revert to the original behavior, they can just set
autovacuum_enabled to off for the parent tables.

So, I agree that we should enable this new feature for inheritance
parents too.

I can't comment on the other issues. I hope to give this a closer look
tomorrow my time; with luck Hosoya-san will have commented by then.

--
�lvaro Herrera 39�49'30"S 73�17'W
"La rebeld�a es la virtud original del hombre" (Arthur Schopenhauer)

#63Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alvaro Herrera (#62)
Re: Autovacuum on partitioned table (autoanalyze)

On 4/4/21 10:05 PM, Alvaro Herrera wrote:

On 2021-Apr-04, Tomas Vondra wrote:

1) I still don't understand why inheritance and declarative partitioning
are treated differently. Seems unnecessary nad surprising, but maybe
there's a good reason?

I suppose the rationale is that for inheritance we have always done it
that way -- similar things have been done that way for inheritance
historically, to avoid messing with long-standing behavior. We have
done that in a bunch of places -- DDL behavior, FKs, etc. Maybe in this
case it's not justified. It *will* change behavior, in the sense that
we are going to capture stats that have never been captured before.
That might or might not affect query plans for designs using regular
inheritance. But it seems reasonable to think that those changes will
be for the good; and if it does break plans for some people and they
want to revert to the original behavior, they can just set
autovacuum_enabled to off for the parent tables.

So, I agree that we should enable this new feature for inheritance
parents too.

Not sure. AFAICS the missing stats on parents are an issue both for
inheritance and partitioning. Maybe there is a reason to maintain the
current behavior with inheritance, but I don't see it.

With the other features, I think the reason for not implementing that
for inheritance was that it'd be more complex, compared to declarative
partitioning (which has stricter limitations on the partitions, etc.).
But in this case I think there's no difference in complexity, the same
code can handle both cases.

In fact, one of the first posts in this threads links to this:

/messages/by-id/4823.1262132964@sss.pgh.pa.us

i.e. Tom actually proposed doing something like this back in 2009, so
presumably he though it's desirable back then.

OTOH he argued against adding another per-table counter and proposed
essentially what the patch did before, i.e. propagating the counter
after analyze. But we know that may trigger analyze too often ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#64Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#63)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-04, Tomas Vondra wrote:

In fact, one of the first posts in this threads links to this:

/messages/by-id/4823.1262132964@sss.pgh.pa.us

i.e. Tom actually proposed doing something like this back in 2009, so
presumably he though it's desirable back then.

OTOH he argued against adding another per-table counter and proposed
essentially what the patch did before, i.e. propagating the counter
after analyze. But we know that may trigger analyze too often ...

Yeah, I think that's a doomed approach. The reason to avoid another
column is to avoid bloat, which is good but if we end up with an
unworkable design then we know we have to backtrack on it.

I was thinking that we could get away with having a separate pgstat
struct for partitioned tables, to avoid enlarging the struct for all
tables, but if we're going to also include legacy inheritance in the
feature clearly that doesn't work.

--
�lvaro Herrera Valdivia, Chile
"After a quick R of TFM, all I can say is HOLY CR** THAT IS COOL! PostgreSQL was
amazing when I first started using it at 7.2, and I'm continually astounded by
learning new features and techniques made available by the continuing work of
the development team."
Berend Tober, http://archives.postgresql.org/pgsql-hackers/2007-08/msg01009.php

#65Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#61)
Re: Autovacuum on partitioned table (autoanalyze)

On 4/4/21 9:08 PM, Tomas Vondra wrote:

On 4/3/21 9:42 PM, Alvaro Herrera wrote:

Thanks for the quick rework. I like this design much better and I think
this is pretty close to committable. Here's a rebased copy with some
small cleanups (most notably, avoid calling pgstat_propagate_changes
when the partition doesn't have a tabstat entry; also, free the lists
that are allocated in a couple of places).

I didn't actually verify that it works.

...

3) pgstat_recv_analyze

Shouldn't it propagate the counters before resetting them? I understand
that for the just-analyzed relation we can't do better, but why not to
propagate the counters to parents? (Not necessarily from this place in
the stat collector, maybe the analyze process should do that.)

FWIW the scenario I had in mind is something like this:

create table t (a int, b int) partition by hash (a);
create table p0 partition of t for values with (modulus 2, remainder 0);
create table p1 partition of t for values with (modulus 2, remainder 1);

insert into t select i, i from generate_series(1,1000000) s(i);

select relname, n_mod_since_analyze from pg_stat_user_tables;

test=# select relname, n_mod_since_analyze from pg_stat_user_tables;
relname | n_mod_since_analyze
---------+---------------------
t | 0
p0 | 499375
p1 | 500625
(3 rows)

test=# analyze p0, p1;
ANALYZE
test=# select relname, n_mod_since_analyze from pg_stat_user_tables;
relname | n_mod_since_analyze
---------+---------------------
t | 0
p0 | 0
p1 | 0
(3 rows)

This may seem a bit silly - who would analyze the hash partitions
directly? However, with other partitioning schemes (list, range) it's
quite plausible that people load data directly into partition. They can
analyze the parent explicitly too, but with multi-level partitioning
that probably requires analyzing all the ancestors.

The other possible scenario is about rows inserted while p0/p1 are being
processed by autoanalyze. That may actually take quite a bit of time,
depending on vacuum cost limit. So I still think we should propagate the
delta after the analyze, before we reset the counters.

I also realized relation_needs_vacanalyze is not really doing what I
suggested - it propagates the counts, but does so in the existing loop
which checks which relations need vacuum/analyze.

That means we may skip the parent table in the *current* round, because
it'll see the old (not yet updated) counts. It's likely to be processed
in the next autovacuum round, but that may actually not happen. The
trouble is the reltuples for the parent is calculated using *current*
child reltuples values, but we're comparing it to the *old* value of
changes_since_analyze. So e.g. if enough rows were inserted into the
partitions, it may still be below the analyze threshold.

What I proposed is adding a separate loop that *only* propagates the
counts, and then re-read the current stats (perhaps only if we actually
propagated anything). And then decide which relations need analyze.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#61)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-04, Tomas Vondra wrote:

1) I still don't understand why inheritance and declarative partitioning
are treated differently. Seems unnecessary nad surprising, but maybe
there's a good reason?

I think there is a good reason to treat them the same: pgstat does not
have a provision to keep stats both of the table with children, and the
table without children. It can only have one of those. For
partitioning that doesn't matter: since the table-without-children
doesn't have anything on its own (no scans, no tuples, no nothing) then
we can just use the entry to store the table-with-children data. But
for the inheritance case, the parent can have its own tuples and counts
its own scans and so on; so if we change things, we'll overwrite the
stats. Maybe in the long-term we should allow pgstat to differentiate
those cases, but that seems not in scope for this patch.

I'm working on the code to fix the other issues.

--
�lvaro Herrera 39�49'30"S 73�17'W

#67Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#66)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-04-06 16:56:49 -0400, Alvaro Herrera wrote:

I think there is a good reason to treat them the same: pgstat does not
have a provision to keep stats both of the table with children, and the
table without children. It can only have one of those. For
partitioning that doesn't matter: since the table-without-children
doesn't have anything on its own (no scans, no tuples, no nothing) then
we can just use the entry to store the table-with-children data. But
for the inheritance case, the parent can have its own tuples and counts
its own scans and so on; so if we change things, we'll overwrite the
stats. Maybe in the long-term we should allow pgstat to differentiate
those cases, but that seems not in scope for this patch.

FWIW, I think it shouldn't be too hard to do that once the shared memory
stats patch goes in (not 14, unfortunately). The hardest part will be to
avoid exploding the number of interface functions, but I think we can
figure out a way to deal with that.

Greetings,

Andres Freund

#68yuzuko
yuzukohosoya@gmail.com
In reply to: Andres Freund (#67)
Re: Autovacuum on partitioned table (autoanalyze)

Hello,

Thank you for reviewing.
I'm working on fixing the patch according to the comments.
I'll send it as soon as I can.

On 2021-04-06 16:56:49 -0400, Alvaro Herrera wrote:

I think there is a good reason to treat them the same: pgstat does not
have a provision to keep stats both of the table with children, and the
table without children. It can only have one of those. For
partitioning that doesn't matter: since the table-without-children
doesn't have anything on its own (no scans, no tuples, no nothing) then
we can just use the entry to store the table-with-children data. But
for the inheritance case, the parent can have its own tuples and counts
its own scans and so on; so if we change things, we'll overwrite the
stats. Maybe in the long-term we should allow pgstat to differentiate
those cases, but that seems not in scope for this patch.

FWIW, I think it shouldn't be too hard to do that once the shared memory
stats patch goes in (not 14, unfortunately). The hardest part will be to
avoid exploding the number of interface functions, but I think we can
figure out a way to deal with that.

I've been thinking about traditional inheritance, I realized that we
need additional
handling to support them because unlike declarative partitioning,
parents may have
some rows in the case of traditional inheritance as Alvaro mentioned.
So I think we should support only declarative partitioning in this
patch for now,
but what do you think? I'm not sure but if we can solve this matter
at low cost by
using the shared memory stats patch, should we wait for the patch?

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

#69Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: yuzuko (#68)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-07, yuzuko wrote:

I'm working on fixing the patch according to the comments.
I'll send it as soon as I can.

Thanks, I've been giving it a look too.

I've been thinking about traditional inheritance, I realized that we
need additional
handling to support them because unlike declarative partitioning,
parents may have
some rows in the case of traditional inheritance as Alvaro mentioned.
So I think we should support only declarative partitioning in this
patch for now,
but what do you think?

Yeah, not fixable at present I think.

I'm not sure but if we can solve this matter at low cost by using the
shared memory stats patch, should we wait for the patch?

Let's do that for 15.

--
�lvaro Herrera 39�49'30"S 73�17'W
"The problem with the future is that it keeps turning into the present"
(Hobbes)

#70yuzuko
yuzukohosoya@gmail.com
In reply to: Tomas Vondra (#61)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

I fixed the patch according to the following comments.
Attach the latest patch. It is based on v14 patch Alvaro attached before.

On Mon, Apr 5, 2021 at 4:08 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/3/21 9:42 PM, Alvaro Herrera wrote:

Thanks for the quick rework. I like this design much better and I think
this is pretty close to committable. Here's a rebased copy with some
small cleanups (most notably, avoid calling pgstat_propagate_changes
when the partition doesn't have a tabstat entry; also, free the lists
that are allocated in a couple of places).

I didn't actually verify that it works.

Yeah, this approach seems much simpler, I think. That being said, I
think there's a couple issues:

1) I still don't understand why inheritance and declarative partitioning
are treated differently. Seems unnecessary nad surprising, but maybe
there's a good reason?

As we discussed in this thread, this patch should handle only declarative
partitioning for now.

2) pgstat_recv_tabstat

Should it really reset changes_since_analyze_reported in both branches?
AFAICS if the "found" branch does this

tabentry->changes_since_analyze_reported = 0;

that means we lose the counter any time tabstats are received, no?
That'd be wrong, because we'd propagate the changes repeatedly.

I changed the changes_since_analyze_reported counter not to reset.

3) pgstat_recv_analyze

Shouldn't it propagate the counters before resetting them? I understand
that for the just-analyzed relation we can't do better, but why not to
propagate the counters to parents? (Not necessarily from this place in
the stat collector, maybe the analyze process should do that.)

I realized that we should propagate the counters for manual ANALYZE too.
thanks to the examples you offered in another email.
I fixed that for manual ANALYZE.

4) pgstat_recv_reportedchanges

I think this needs to be more careful when updating the value - the
stats collector might have received other messages modifying those
counters (including e.g. PGSTAT_MTYPE_ANALYZE with a reset), so maybe we
can get into situation with

(changes_since_analyze_reported > changes_since_analyze)

if we just blindly increment the value. I'd bet would lead to funny
stuff. So maybe this should be careful to never exceed this?

pgstat_propagate_changes() calls pgstat_report_reportedchanges()
only if (changes_since_analyze_reported < changes_since_analyze).
So I think we won't get into the such situation

(changes_since_analyze_reported > changes_since_analyze)

but am I missing something?

I also realized relation_needs_vacanalyze is not really doing what I
suggested - it propagates the counts, but does so in the existing loop
which checks which relations need vacuum/analyze.

That means we may skip the parent table in the *current* round, because
it'll see the old (not yet updated) counts. It's likely to be processed
in the next autovacuum round, but that may actually not happen. The
trouble is the reltuples for the parent is calculated using *current*
child reltuples values, but we're comparing it to the *old* value of
changes_since_analyze. So e.g. if enough rows were inserted into the
partitions, it may still be below the analyze threshold.

Indeed, the partitioned table was not analyzed at the same timing as
its leaf partitions due to the delay of propagating counters. According
to your proposal, I added a separate loop to propagate the counters
before collecting a list of relations to vacuum/analyze.

--
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachments:

v15_autovacuum_on_partitioned_table.patchapplication/octet-stream; name=v15_autovacuum_on_partitioned_table.patchDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d897bbe..5554275 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1962,12 +1962,11 @@ bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d..fb41b06 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -660,7 +660,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -680,7 +680,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f84616d..6e06dc3 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -89,7 +89,7 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
-						   bool inh, bool in_outer_xact, int elevel);
+						   bool inh, Oid toprel_oid, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
 								HeapTuple *rows, int numrows,
@@ -118,7 +118,8 @@ static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
  */
 void
 analyze_rel(Oid relid, RangeVar *relation,
-			VacuumParams *params, List *va_cols, bool in_outer_xact,
+			VacuumParams *params, List *va_cols,
+			Oid toprel_oid, bool in_outer_xact,
 			BufferAccessStrategy bstrategy)
 {
 	Relation	onerel;
@@ -259,14 +260,14 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc,
-					   relpages, false, in_outer_xact, elevel);
+					   relpages, false, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
 		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+					   true, toprel_oid, in_outer_xact, elevel);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -289,8 +290,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 static void
 do_analyze_rel(Relation onerel, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
-			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   BlockNumber relpages, bool inh, Oid toprel_oid,
+			   bool in_outer_xact, int elevel)
 {
 	int			attr_cnt,
 				tcnt,
@@ -655,20 +656,22 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
+	}
 
-		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
-		 */
+	/*
+	 * Now report ANALYZE to the stats collector.
+	 *
+	 * Regarding inherited stats, we report only in the case of declarative
+	 * partitioning.  For partitioning based on inheritance, stats collector
+	 * only tracks per-table stats.
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
-	}
+							  (va_cols == NIL), toprel_oid);
+
 
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 662aff0..d0dcd0c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -474,7 +474,7 @@ vacuum(List *relations, VacuumParams *params,
 				}
 
 				analyze_rel(vrel->oid, vrel->relation, params,
-							vrel->va_cols, in_outer_xact, vac_strategy);
+							vrel->va_cols, vrel->toprel_oid, in_outer_xact, vac_strategy);
 
 				if (use_own_xacts)
 				{
@@ -801,7 +801,8 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 			oldcontext = MemoryContextSwitchTo(vac_context);
 			vacrels = lappend(vacrels, makeVacuumRelation(vrel->relation,
 														  relid,
-														  vrel->va_cols));
+														  vrel->va_cols,
+														  relid));
 			MemoryContextSwitchTo(oldcontext);
 		}
 
@@ -838,7 +839,8 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				oldcontext = MemoryContextSwitchTo(vac_context);
 				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 															  part_oid,
-															  vrel->va_cols));
+															  vrel->va_cols,
+															  relid));
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
@@ -904,7 +906,8 @@ get_all_vacuum_rels(int options)
 		oldcontext = MemoryContextSwitchTo(vac_context);
 		vacrels = lappend(vacrels, makeVacuumRelation(NULL,
 													  relid,
-													  NIL));
+													  NIL,
+													  InvalidOid));
 		MemoryContextSwitchTo(oldcontext);
 	}
 
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 01c110c..e431385 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -806,12 +806,13 @@ makeGroupingSet(GroupingSetKind kind, List *content, int location)
  *	  create a VacuumRelation node
  */
 VacuumRelation *
-makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
+makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid)
 {
 	VacuumRelation *v = makeNode(VacuumRelation);
 
 	v->relation = relation;
 	v->oid = oid;
 	v->va_cols = va_cols;
+	v->toprel_oid = toprel_oid;
 	return v;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8b1bad0..7b915cd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10692,7 +10692,7 @@ opt_name_list:
 vacuum_relation:
 			qualified_name opt_name_list
 				{
-					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2);
+					$$ = (Node *) makeVacuumRelation($1, InvalidOid, $2, InvalidOid);
 				}
 		;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c..096a979 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -74,7 +74,9 @@
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -1970,6 +1972,7 @@ do_autovacuum(void)
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
 	int			i;
+	bool		updated = false;
 
 	/*
 	 * StartTransactionCommand and CommitTransactionCommand will automatically
@@ -2055,11 +2058,11 @@ do_autovacuum(void)
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
 	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * relations, materialized views and partitioned tables, and on the second
+	 * one we collect TOAST tables. The reason for doing the second pass is
+	 * that during it we want to use the main relation's pg_class.reloptions
+	 * entry if the TOAST table does not have any, and we cannot obtain it
+	 * unless we know beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2068,6 +2071,42 @@ do_autovacuum(void)
 	relScan = table_beginscan_catalog(classRel, 0, NULL);
 
 	/*
+	 * Before collecting the list of tables to vacuum, we propagate
+	 * changes_since_analyze count from leaf partitions to ancestors.
+	 * This counter enables auto-analyze on partitioned tables.
+	 */
+	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+		PgStat_StatTabEntry *tabentry;
+		Oid			relid;
+
+		if (!classForm->relispartition ||
+			classForm->relkind == RELKIND_PARTITIONED_TABLE ||
+			classForm->relpersistence == RELPERSISTENCE_TEMP)
+			continue;
+
+		relid = classForm->oid;
+
+		/* Fetch the pgstat entry for this table */
+		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+											 shared, dbentry);
+
+		/* Propagate counter to all of ancestors. */
+		if (tabentry)
+			pgstat_propagate_changes(classForm, tabentry, InvalidOid, &updated);	
+	}
+
+	/* Use fresh stats */
+	if (updated)
+	{
+		autovac_refresh_stats();
+
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);	
+		shared = pgstat_fetch_stat_dbentry(InvalidOid);
+	}
+
+	/*
 	 * On the first pass, we collect main tables to vacuum, and also the main
 	 * table relid to TOAST relid mapping.
 	 */
@@ -2082,7 +2121,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2745,6 +2785,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3161,7 +3202,44 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+		{
+			reltuples = classForm->reltuples;
+		}
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up
+			 * children's reltuples.
+			 */
+			List	   *children;
+			ListCell   *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid			childOID = lfirst_oid(lc);
+				HeapTuple	childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+				ReleaseSysCache(childtuple);
+			}
+
+			list_free(children);
+		}
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
@@ -3225,7 +3303,7 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
 
 	/* Set up one VacuumRelation target, identified by OID, for vacuum() */
 	rangevar = makeRangeVar(tab->at_nspname, tab->at_relname, -1);
-	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL);
+	rel = makeVacuumRelation(rangevar, tab->at_relid, NIL, InvalidOid);
 	rel_list = list_make1(rel);
 
 	vacuum(rel_list, &tab->at_params, bstrategy, true);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4b9bcd2..d5b9d0a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -373,6 +374,8 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len);
+static void pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1622,12 +1625,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
 pgstat_report_analyze(Relation rel,
 					  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-					  bool resetcounter)
+					  bool resetcounter, Oid toprel_oid)
 {
 	PgStat_MsgAnalyze msg;
 
@@ -1643,23 +1649,48 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
 	 * committing.)
+	 *
+	 * For partitioned tables, we don't report live and dead tuples, because
+	 * such tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
-	}
+		/*
+		 * If the relation is a leaf partition and this is not autovacuum process,
+		 * propagate changes_since_analyze countes to the ancestors.
+		*/
+		if (!IsAutoVacuumWorkerProcess() &&
+			rel->rd_rel->relispartition &&
+			rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE &&
+			rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
+		{
+			PgStat_StatDBEntry *dbentry;
+			PgStat_StatTabEntry *tabentry;
 
+			/* Fetch the pgstat entry for this table */
+			dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+			tabentry = pgstat_get_tab_entry(dbentry, RelationGetRelid(rel), true);
+			pgstat_propagate_changes(rel->rd_rel, tabentry, toprel_oid, NULL);
+		}
+	}
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
 	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
 	msg.m_tableoid = RelationGetRelid(rel);
@@ -1672,6 +1703,118 @@ pgstat_report_analyze(Relation rel,
 }
 
 /* --------
+ * pgstat_report_partchanges() -
+ *
+ *  Propagate changes_since_analyze counter from a leaf partition to its parent.
+ * --------
+ */
+void
+pgstat_report_partchanges(Relation rel, PgStat_Counter changed_tuples)
+{
+	PgStat_MsgPartChanges msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_PARTCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples = changed_tuples;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/* --------
+ * pgstat_report_reportedchanges() -
+ *
+ *  Tell the collector changes_since_analyze counter we have already
+ *  propagated to its ancestors.
+ * --------
+ */
+void
+pgstat_report_reportedchanges(Relation rel, PgStat_Counter changed_tuples_reported)
+{
+	PgStat_MsgReportedChanges msg;
+
+	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+		return;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPORTEDCHANGES);
+	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+	msg.m_tableoid = RelationGetRelid(rel);
+	msg.m_changed_tuples_reported = changed_tuples_reported;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+/*
+ * pgstat_propagate_changes
+ *
+ *		Propagate changes_since_analyze counter to all of ancestors
+ *		to analyze partitioned tables automatically
+ *
+ * We can decide whether a partitioned table needs auto analyze according to
+ * changes_since_analyze which is propagated from all of the leaf partitions.
+ * To know the correct difference of partitioned table from the last analyze,
+ * we should track changes_since_analyze_reported counter for leaf partitions
+ * as well as changes_since_analyze counter.  While changes_since_analyze
+ * counter tracks the number of changed tuples from the last analyze per
+ * partitions, changes_since_analyze_reported counter tracks changes_since_analyze
+ * we already propagated to ancestors.  Then, we propagate only the difference
+ * between these counters to the partitioned table.
+ */
+void
+pgstat_propagate_changes(Form_pg_class classForm, PgStat_StatTabEntry *tabentry,
+						 Oid toprel_oid, bool *updated)
+{
+	float4		anltuples,
+				anltuples_reported,
+				change_count;
+	List	   *ancestors;
+	ListCell   *lc;
+	Relation	parentrel,
+				childrel;
+
+	/*
+	 * Get its all ancestors to propagate changes_since_analyze count.
+	 * When doing manual ANALYZE on inheritance tree, toprel_oid that indicates
+	 * top level table's OID is a valid.  In this case, we should propagate
+	 * the counter to only ancestors which are not analyzed in this round.
+	 * So we get ancestors of toprel_oid.
+	 */
+	if (!OidIsValid(toprel_oid))
+		ancestors = get_partition_ancestors(classForm->oid);
+	else
+		ancestors = get_partition_ancestors(toprel_oid);
+
+	anltuples = tabentry->changes_since_analyze;
+	anltuples_reported = tabentry->changes_since_analyze_reported;
+	change_count = anltuples - anltuples_reported;
+
+	/* update changes_since_analyze of ancestors */
+	if (anltuples > 0 && change_count > 0)
+	{
+		foreach(lc, ancestors)
+		{
+			Oid			relid = lfirst_oid(lc);
+
+			parentrel = table_open(relid, AccessShareLock);
+			pgstat_report_partchanges(parentrel, change_count);
+			table_close(parentrel, AccessShareLock);
+		}
+
+		/* update own changes_since_analyze_reported */
+		childrel = table_open(classForm->oid, AccessShareLock);
+		pgstat_report_reportedchanges(childrel, change_count);
+		table_close(childrel, AccessShareLock);
+	}
+
+	/* If we updated the stats, *updated is set true to refresh that */
+	if (updated)
+		*updated = (anltuples > 0 && change_count > 0);
+
+	list_free(ancestors);
+}
+
+/* --------
  * pgstat_report_recovery_conflict() -
  *
  *	Tell the collector about a Hot Standby recovery conflict.
@@ -1986,7 +2129,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -5001,6 +5145,14 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_PARTCHANGES:
+					pgstat_recv_partchanges(&msg.msg_partchanges, len);
+					break;
+
+				case PGSTAT_MTYPE_REPORTEDCHANGES:
+					pgstat_recv_reportedchanges(&msg.msg_reportedchanges, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -5215,6 +5367,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
+		result->changes_since_analyze_reported = 0;
 		result->inserts_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
@@ -6477,6 +6630,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported
+				= tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
@@ -6512,6 +6667,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported
+				+= tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
@@ -6868,7 +7025,10 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 * have no good way to estimate how many of those there were.
 	 */
 	if (msg->m_resetcounter)
+	{
 		tabentry->changes_since_analyze = 0;
+		tabentry->changes_since_analyze_reported = 0;
+	}
 
 	if (msg->m_autovacuum)
 	{
@@ -6882,6 +7042,34 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_partchanges(PgStat_MsgPartChanges *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze += msg->m_changed_tuples;
+}
+
+
+static void
+pgstat_recv_reportedchanges(PgStat_MsgReportedChanges *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	tabentry->changes_since_analyze_reported += msg->m_changed_tuples_reported;
+}
+
+
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5..a4d848b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -280,8 +280,8 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation,
-						VacuumParams *params, List *va_cols, bool in_outer_xact,
-						BufferAccessStrategy bstrategy);
+						VacuumParams *params, List *va_cols, Oid toprel_oid,
+						bool in_outer_xact,	BufferAccessStrategy bstrategy);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 48a7ebf..8709a92 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -104,6 +104,6 @@ extern DefElem *makeDefElemExtended(char *nameSpace, char *name, Node *arg,
 
 extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int location);
 
-extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
+extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols, Oid toprel_oid);
 
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7960cfe..7106457 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3317,6 +3317,9 @@ typedef struct VacuumRelation
 	RangeVar   *relation;		/* table name to process, or NULL */
 	Oid			oid;			/* table's OID; InvalidOid if not looked up */
 	List	   *va_cols;		/* list of column names, or NIL for all */
+
+	/* top level table's OID for manual ANALYZE inheritance tree */
+	Oid	   	    toprel_oid;
 } VacuumRelation;
 
 /* ----------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d699502..d67e82d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "catalog/pg_class.h"
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
@@ -70,6 +71,8 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_PARTCHANGES,
+	PGSTAT_MTYPE_REPORTEDCHANGES,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -127,6 +130,7 @@ typedef struct PgStat_TableCounts
 	PgStat_Counter t_delta_live_tuples;
 	PgStat_Counter t_delta_dead_tuples;
 	PgStat_Counter t_changed_tuples;
+	PgStat_Counter t_changed_tuples_reported;
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
@@ -430,6 +434,32 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgPartChanges			Sent by the autovacuum deamon to propagate
+ *                                  the changed_tuples counter.
+ * ----------
+ */
+typedef struct PgStat_MsgPartChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples;
+} PgStat_MsgPartChanges;
+
+/* ----------
+ * PgStat_MsgReportedChanges			Sent by the autovacuum deamon to update
+ *                                      changed_tuples_reported.
+ * ----------
+ */
+typedef struct PgStat_MsgReportedChanges
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	PgStat_Counter m_changed_tuples_reported;
+} PgStat_MsgReportedChanges;
+
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -675,6 +705,8 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgPartChanges msg_partchanges;
+	PgStat_MsgReportedChanges msg_reportedchanges;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -770,6 +802,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
+	PgStat_Counter changes_since_analyze_reported;
 	PgStat_Counter inserts_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
@@ -1444,7 +1477,12 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 								 PgStat_Counter livetuples, PgStat_Counter deadtuples);
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
-								  bool resetcounter);
+								  bool resetcounter, Oid toprel_oid);
+
+extern void pgstat_report_partchanges(Relation rel, PgStat_Counter changes_tuples);
+extern void pgstat_report_reportedchanges(Relation rel, PgStat_Counter changes_tuples_reported);
+extern void pgstat_propagate_changes(Form_pg_class classForm, PgStat_StatTabEntry *tabentry,
+									 Oid toprel_oid, bool *updated);
 
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b..954afb9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1806,7 +1806,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2209,7 +2209,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#71Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: yuzuko (#70)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

However, I just noticed there is a huge problem, which is that the new
code in relation_needs_vacanalyze() is doing find_all_inheritors(), and
we don't necessarily have a snapshot that lets us do that. While adding
a snapshot acquisition at that spot is a very easy fix, I hesitate to
fix it that way, because the whole idea there seems quite wasteful: we
have to look up, open and lock every single partition, on every single
autovacuum iteration through the database. That seems bad. I'm
inclined to think that a better idea may be to store reltuples for the
partitioned table in pg_class.reltuples, instead of having to add up the
reltuples of each partition. I haven't checked if this is likely to
break anything.

(Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a
partition, then we repeatedly propagate the counts to the parent table,
so we would cause the parent to be analyzed more times than it should.
Sounds like we should not send the ancestor list when a column list is
given to manual analyze. I haven't verified this, however.)

--
�lvaro Herrera Valdivia, Chile
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Attachments:

v16_autovacuum_on_partitioned_table.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d897bbec2b..5554275e64 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -246,7 +246,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -420,7 +420,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP,
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1962,12 +1962,11 @@ bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
 	/*
-	 * There are no options for partitioned tables yet, but this is able to do
-	 * some validation.
+	 * autovacuum_enabled, autovacuum_analyze_threshold and
+	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
 	 */
-	return (bytea *) build_reloptions(reloptions, validate,
-									  RELOPT_KIND_PARTITIONED,
-									  0, NULL, 0);
+
+	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4d6b232787..a47e102f36 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -660,7 +660,7 @@ CREATE VIEW pg_stat_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_xact_all_tables AS
@@ -680,7 +680,7 @@ CREATE VIEW pg_stat_xact_all_tables AS
     FROM pg_class C LEFT JOIN
          pg_index I ON C.oid = I.indrelid
          LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-    WHERE C.relkind IN ('r', 't', 'm')
+    WHERE C.relkind IN ('r', 't', 'm', 'p')
     GROUP BY C.oid, N.nspname, C.relname;
 
 CREATE VIEW pg_stat_sys_tables AS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f84616d3d2..0789117bb8 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -612,8 +612,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								 PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
 
 	/*
-	 * Update pages/tuples stats in pg_class, and report ANALYZE to the stats
-	 * collector ... but not if we're doing inherited stats.
+	 * Update pages/tuples stats in pg_class ... but not if we're doing
+	 * inherited stats.
 	 *
 	 * We assume that VACUUM hasn't set pg_class.reltuples already, even
 	 * during a VACUUM ANALYZE.  Although VACUUM often updates pg_class,
@@ -655,19 +655,33 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
+	}
 
-		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
-		 */
+	/*
+	 * Now report ANALYZE to the stats collector.  For regular tables, we do
+	 * it only if not doing inherited stats.  For partitioned tables, we only
+	 * do it for inherited stats. (We're never called for not-inherited stats
+	 * on partitioned tables anyway.)
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
+
+	/*
+	 * If this is a manual analyze of a permanent leaf partition and not doing
+	 * inherited stats, also let the collector know about the ancestor tables
+	 * of this partition.  Autovacuum does the equivalent of this at the start
+	 * of its run, so there's no reason to do it there.
+	 */
+	if (!inh && !IsAutoVacuumWorkerProcess() &&
+		onerel->rd_rel->relispartition &&
+		onerel->rd_rel->relkind == RELKIND_RELATION &&
+		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+	{
+		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
 	}
 
 	/*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..5517836be6 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,6 +75,7 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -1969,6 +1970,7 @@ do_autovacuum(void)
 	int			effective_multixact_freeze_max_age;
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
+	bool		updated = false;
 	int			i;
 
 	/*
@@ -2054,12 +2056,18 @@ do_autovacuum(void)
 	/*
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
-	 * We do this in two passes: on the first one we collect the list of plain
-	 * relations and materialized views, and on the second one we collect
-	 * TOAST tables. The reason for doing the second pass is that during it we
-	 * want to use the main relation's pg_class.reloptions entry if the TOAST
-	 * table does not have any, and we cannot obtain it unless we know
-	 * beforehand what's the main table OID.
+	 * We do this in three passes: First we let pgstat collector know about
+	 * the partitioned table ancestors of all partitions that have recently
+	 * acquired rows for analyze.  This informs the second pass about the
+	 * total number of tuple count in partitioning hierarchies.
+	 *
+	 * On the second pass, we collect the list of plain relations, materialized
+	 * views and partitioned tables.  On the third one we collect TOAST tables.
+	 *
+	 * The reason for doing the third pass is that during it we want to use the
+	 * main relation's pg_class.reloptions entry if the TOAST table does not
+	 * have any, and we cannot obtain it unless we know beforehand what's the
+	 * main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2068,7 +2076,44 @@ do_autovacuum(void)
 	relScan = table_beginscan_catalog(classRel, 0, NULL);
 
 	/*
-	 * On the first pass, we collect main tables to vacuum, and also the main
+	 * First pass: before collecting the list of tables to vacuum, let stat
+	 * collector know about partitioned-table ancestors of each partition.
+	 */
+	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+		Oid			relid = classForm->oid;
+		PgStat_StatTabEntry	*tabentry;
+
+		/* Only consider permanent leaf partitions */
+		if (!classForm->relispartition ||
+			classForm->relkind == RELKIND_PARTITIONED_TABLE ||
+			classForm->relpersistence == RELPERSISTENCE_TEMP)
+			continue;
+
+		/*
+		 * No need to do this for partitions that haven't acquired any rows.
+		 */
+		tabentry = pgstat_fetch_stat_tabentry(relid);
+		if (tabentry &&
+			tabentry->changes_since_analyze -
+			tabentry->changes_since_analyze_reported > 0)
+		{
+			pgstat_report_anl_ancestors(relid);
+			updated = true;
+		}
+	}
+
+	/* Acquire fresh stats for the next passes, if needed */
+	if (updated)
+	{
+		autovac_refresh_stats();
+		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+		shared = pgstat_fetch_stat_dbentry(InvalidOid);
+	}
+
+	/*
+	 * On the second pass, we collect main tables to vacuum, and also the main
 	 * table relid to TOAST relid mapping.
 	 */
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
@@ -2082,7 +2127,8 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW)
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
 		relid = classForm->oid;
@@ -2157,7 +2203,7 @@ do_autovacuum(void)
 
 	table_endscan(relScan);
 
-	/* second pass: check TOAST tables */
+	/* third pass: check TOAST tables */
 	ScanKeyInit(&key,
 				Anum_pg_class_relkind,
 				BTEqualStrategyNumber, F_CHAREQ,
@@ -2745,6 +2791,7 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
+		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
@@ -3161,7 +3208,44 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
+		{
+			reltuples = classForm->reltuples;
+		}
+		else
+		{
+			/*
+			 * If the relation is a partitioned table, we must add up
+			 * children's reltuples.
+			 */
+			List	   *children;
+			ListCell   *lc;
+
+			reltuples = 0;
+
+			/* Find all members of inheritance set taking AccessShareLock */
+			children = find_all_inheritors(relid, AccessShareLock, NULL);
+
+			foreach(lc, children)
+			{
+				Oid			childOID = lfirst_oid(lc);
+				HeapTuple	childtuple;
+				Form_pg_class childclass;
+
+				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
+				childclass = (Form_pg_class) GETSTRUCT(childtuple);
+
+				/* Skip a partitioned table and foreign partitions */
+				if (RELKIND_HAS_STORAGE(childclass->relkind))
+				{
+					/* Sum up the child's reltuples for its parent table */
+					reltuples += childclass->reltuples;
+				}
+				ReleaseSysCache(childtuple);
+			}
+
+			list_free(children);
+		}
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..7777f8a18c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -343,6 +344,7 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
+static void pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1592,6 +1594,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
+ * Exceptional support only changes_since_analyze for partitioned tables,
+ * though they don't have any data.  This counter will tell us whether
+ * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1613,21 +1618,31 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
 	 * committing.)
+	 *
+	 * For partitioned tables, we don't report live and dead tuples, because
+	 * such tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
+
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1639,6 +1654,48 @@ pgstat_report_analyze(Relation rel,
 	msg.m_live_tuples = livetuples;
 	msg.m_dead_tuples = deadtuples;
 	pgstat_send(&msg, sizeof(msg));
+
+}
+
+/*
+ * pgstat_report_anl_ancestors
+ *
+ *	Send list of partitioned table ancestors of the given partition to the
+ *	collector.  The collector is in charge of propagating the analyze tuple
+ *	counts from the partition to its ancestors.  This is necessary so that
+ *	other processes can decide whether to analyze the partitioned tables.
+ */
+void
+pgstat_report_anl_ancestors(Oid relid)
+{
+	PgStat_MsgAnlAncestors msg;
+	List	   *ancestors;
+	ListCell   *lc;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANL_ANCESTORS);
+	msg.m_databaseid = MyDatabaseId;
+	msg.m_tableoid = relid;
+	msg.m_nancestors = 0;
+
+	ancestors = get_partition_ancestors(relid);
+	foreach(lc, ancestors)
+	{
+		Oid		ancestor = lfirst_oid(lc);
+
+		msg.m_ancestors[msg.m_nancestors] = ancestor;
+		if (++msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
+		{
+			pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
+						msg.m_nancestors * sizeof(Oid));
+			msg.m_nancestors = 0;
+		}
+	}
+
+	if (msg.m_nancestors > 0)
+		pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
+					msg.m_nancestors * sizeof(Oid));
+
+	list_free(ancestors);
 }
 
 /* --------
@@ -1958,7 +2015,8 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind))
+	if (!RELKIND_HAS_STORAGE(relkind) &&
+		relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -3287,6 +3345,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
+				case PGSTAT_MTYPE_ANL_ANCESTORS:
+					pgstat_recv_anl_ancestors(&msg.msg_anl_ancestors, len);
+					break;
+
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -3501,6 +3563,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
+		result->changes_since_analyze_reported = 0;
 		result->inserts_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
@@ -4768,6 +4831,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported
+				= tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
@@ -4803,6 +4868,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
+			tabentry->changes_since_analyze_reported
+				+= tabmsg->t_counts.t_changed_tuples_reported;
 			tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
@@ -5159,7 +5226,10 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 * have no good way to estimate how many of those there were.
 	 */
 	if (msg->m_resetcounter)
+	{
 		tabentry->changes_since_analyze = 0;
+		tabentry->changes_since_analyze_reported = 0;
+	}
 
 	if (msg->m_autovacuum)
 	{
@@ -5173,6 +5243,29 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+static void
+pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len)
+{
+	PgStat_StatDBEntry *dbentry;
+	PgStat_StatTabEntry *tabentry;
+
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+
+	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+
+	for (int i = 0; i < msg->m_nancestors; i++)
+	{
+		Oid		ancestor_relid = msg->m_ancestors[i];
+		PgStat_StatTabEntry *ancestor;
+
+		ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true);
+		ancestor->changes_since_analyze +=
+			tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported;
+	}
+
+	tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
+
+}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7cd137506e..88b3084640 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -69,6 +69,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
+	PGSTAT_MTYPE_ANL_ANCESTORS,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -126,6 +127,7 @@ typedef struct PgStat_TableCounts
 	PgStat_Counter t_delta_live_tuples;
 	PgStat_Counter t_delta_dead_tuples;
 	PgStat_Counter t_changed_tuples;
+	PgStat_Counter t_changed_tuples_reported;
 
 	PgStat_Counter t_blocks_fetched;
 	PgStat_Counter t_blocks_hit;
@@ -429,6 +431,25 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
+/* ----------
+ * PgStat_MsgAnlAncestors		Sent by the backend or autovacuum daemon
+ *								to inform partitioned tables that are
+ *								ancestors of a partition, to propagate
+ *								analyze counters
+ * ----------
+ */
+#define PGSTAT_NUM_ANCESTORENTRIES    \
+	((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(int))	\
+	 / sizeof(Oid))
+
+typedef struct PgStat_MsgAnlAncestors
+{
+	PgStat_MsgHdr m_hdr;
+	Oid			m_databaseid;
+	Oid			m_tableoid;
+	int			m_nancestors;
+	Oid			m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
+} PgStat_MsgAnlAncestors;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -674,6 +695,7 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
+	PgStat_MsgAnlAncestors msg_anl_ancestors;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -769,6 +791,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
+	PgStat_Counter changes_since_analyze_reported;
 	PgStat_Counter inserts_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
@@ -975,6 +998,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
+extern void pgstat_report_anl_ancestors(Oid relid);
 
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 264deda7af..a8a1cc72d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1807,7 +1807,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_archiver| SELECT s.archived_count,
     s.last_archived_wal,
@@ -2210,7 +2210,7 @@ pg_stat_xact_all_tables| SELECT c.oid AS relid,
    FROM ((pg_class c
      LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
+  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"]))
   GROUP BY c.oid, n.nspname, c.relname;
 pg_stat_xact_sys_tables| SELECT pg_stat_xact_all_tables.relid,
     pg_stat_xact_all_tables.schemaname,
#72Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#71)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-07, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

Pushed with this approach. Thanks for persisting with this.

--
�lvaro Herrera Valdivia, Chile

#73Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#71)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-07, Alvaro Herrera wrote:

However, I just noticed there is a huge problem, which is that the new
code in relation_needs_vacanalyze() is doing find_all_inheritors(), and
we don't necessarily have a snapshot that lets us do that. While adding
a snapshot acquisition at that spot is a very easy fix, I hesitate to
fix it that way, because the whole idea there seems quite wasteful: we
have to look up, open and lock every single partition, on every single
autovacuum iteration through the database. That seems bad. I'm
inclined to think that a better idea may be to store reltuples for the
partitioned table in pg_class.reltuples, instead of having to add up the
reltuples of each partition. I haven't checked if this is likely to
break anything.

I forgot to comment on this aspect. First, I was obviously mistaken
about there not being an active snapshot. I mean, it's correct that
there isn't. The issue is that it's really a bug to require that there
is one; it just hasn't failed before because partially detached
partitions aren't very common. So I patched that as a bug in a
preliminary patch.

Next, the idea of storing the number of tuples in pg_class.reltuples is
a nice one, and I think we should consider it in the long run. However,
while it can be done as a quick job (shown in the attached, which AFAICT
works fine) there are side-effects -- for example, TRUNCATE doesn't
clear the value, which is surely wrong. I suspect that if I try to
handle it in this way, it would blow up in some corner case I forgot to
consider. So, I decided not to go that way, at least for now.

--
�lvaro Herrera Valdivia, Chile

Attachments:

partitioned-reltuples.patchtext/x-diff; charset=us-asciiDownload
commit 5ddb7c00e5f1d63eb251d334afce738919a772c0
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Wed Apr 7 23:52:33 2021 -0400
CommitDate: Wed Apr 7 23:52:33 2021 -0400

    set pg_class.reltuples for partitioned tables

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0789117bb8..36e35722bc 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -656,6 +656,17 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								in_outer_xact);
 		}
 	}
+	else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		/*
+		 * Partitioned tables don't have storage, so we don't set any of these
+		 * value in their pg_class entries.  However, reltuples is necessary
+		 * in order for auto-analyze to work properly, so update that.
+		 */
+		vac_update_relstats(onerel, 0, totalrows,
+							0, false, InvalidTransactionId, InvalidMultiXactId,
+							in_outer_xact);
+	}
 
 	/*
 	 * Now report ANALYZE to the stats collector.  For regular tables, we do
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 5517836be6..48c1bf048f 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3208,44 +3208,7 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		if (classForm->relkind != RELKIND_PARTITIONED_TABLE)
-		{
-			reltuples = classForm->reltuples;
-		}
-		else
-		{
-			/*
-			 * If the relation is a partitioned table, we must add up
-			 * children's reltuples.
-			 */
-			List	   *children;
-			ListCell   *lc;
-
-			reltuples = 0;
-
-			/* Find all members of inheritance set taking AccessShareLock */
-			children = find_all_inheritors(relid, AccessShareLock, NULL);
-
-			foreach(lc, children)
-			{
-				Oid			childOID = lfirst_oid(lc);
-				HeapTuple	childtuple;
-				Form_pg_class childclass;
-
-				childtuple = SearchSysCache1(RELOID, ObjectIdGetDatum(childOID));
-				childclass = (Form_pg_class) GETSTRUCT(childtuple);
-
-				/* Skip a partitioned table and foreign partitions */
-				if (RELKIND_HAS_STORAGE(childclass->relkind))
-				{
-					/* Sum up the child's reltuples for its parent table */
-					reltuples += childclass->reltuples;
-				}
-				ReleaseSysCache(childtuple);
-			}
-
-			list_free(children);
-		}
+		reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
 		anltuples = tabentry->changes_since_analyze;
#74Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alvaro Herrera (#71)
Re: Autovacuum on partitioned table (autoanalyze)

On 4/8/21 5:22 AM, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

However, I just noticed there is a huge problem, which is that the new
code in relation_needs_vacanalyze() is doing find_all_inheritors(), and
we don't necessarily have a snapshot that lets us do that. While adding
a snapshot acquisition at that spot is a very easy fix, I hesitate to
fix it that way, because the whole idea there seems quite wasteful: we
have to look up, open and lock every single partition, on every single
autovacuum iteration through the database. That seems bad. I'm
inclined to think that a better idea may be to store reltuples for the
partitioned table in pg_class.reltuples, instead of having to add up the
reltuples of each partition. I haven't checked if this is likely to
break anything.

How would that value get updated, for the parent?

(Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a
partition, then we repeatedly propagate the counts to the parent table,
so we would cause the parent to be analyzed more times than it should.
Sounds like we should not send the ancestor list when a column list is
given to manual analyze. I haven't verified this, however.)

Are you sure? I haven't tried, but shouldn't this be prevented by only
sending the delta between the current and last reported value?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#75Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#74)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-08, Tomas Vondra wrote:

On 4/8/21 5:22 AM, Alvaro Herrera wrote:

However, I just noticed there is a huge problem, which is that the new
code in relation_needs_vacanalyze() is doing find_all_inheritors(), and
we don't necessarily have a snapshot that lets us do that. While adding
a snapshot acquisition at that spot is a very easy fix, I hesitate to
fix it that way, because the whole idea there seems quite wasteful: we
have to look up, open and lock every single partition, on every single
autovacuum iteration through the database. That seems bad. I'm
inclined to think that a better idea may be to store reltuples for the
partitioned table in pg_class.reltuples, instead of having to add up the
reltuples of each partition. I haven't checked if this is likely to
break anything.

How would that value get updated, for the parent?

Same as for any other relation: ANALYZE would set it, after it's done
scanning the table. We would to make sure that nothing resets it to
empty, though, and that it doesn't cause issues elsewhere. (The patch I
sent contains the minimal change to make it work, but of course that's
missing having other pieces of code maintain it.)

(Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a
partition, then we repeatedly propagate the counts to the parent table,
so we would cause the parent to be analyzed more times than it should.
Sounds like we should not send the ancestor list when a column list is
given to manual analyze. I haven't verified this, however.)

Are you sure? I haven't tried, but shouldn't this be prevented by only
sending the delta between the current and last reported value?

I did try, and yes it behaves as you say.

--
�lvaro Herrera Valdivia, Chile
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree. (Don Knuth)

#76Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alvaro Herrera (#75)
Re: Autovacuum on partitioned table (autoanalyze)

On 4/8/21 5:27 PM, Alvaro Herrera wrote:

On 2021-Apr-08, Tomas Vondra wrote:

On 4/8/21 5:22 AM, Alvaro Herrera wrote:

However, I just noticed there is a huge problem, which is that the new
code in relation_needs_vacanalyze() is doing find_all_inheritors(), and
we don't necessarily have a snapshot that lets us do that. While adding
a snapshot acquisition at that spot is a very easy fix, I hesitate to
fix it that way, because the whole idea there seems quite wasteful: we
have to look up, open and lock every single partition, on every single
autovacuum iteration through the database. That seems bad. I'm
inclined to think that a better idea may be to store reltuples for the
partitioned table in pg_class.reltuples, instead of having to add up the
reltuples of each partition. I haven't checked if this is likely to
break anything.

How would that value get updated, for the parent?

Same as for any other relation: ANALYZE would set it, after it's done
scanning the table. We would to make sure that nothing resets it to
empty, though, and that it doesn't cause issues elsewhere. (The patch I
sent contains the minimal change to make it work, but of course that's
missing having other pieces of code maintain it.)

So ANALYZE would inspect the child relations, sum the reltuples and set
it for the parent? IMO that'd be problematic because it'd mean we're
comparing the current number of changes with reltuples value which may
be arbitrarily stale (if we haven't analyzed the parent for a while).

That's essentially the issue I described when explaining why I think the
code needs to propagate the changes, reread the stats and then evaluate
which relations need vacuuming. It's similar to the issue of comparing
old changes_since_analyze vs. current reltuples, which is why the code
is rereading the stats before checking the thresholds. This time it's
the opposite direction - the reltuples might be stale.

FWIW I think the current refresh logic is not quite correct, because
autovac_refresh_stats does some throttling (STATS_READ_DELAY). It
probably needs a "force" parameter to ensure it actually reads the
current stats in this one case.

(Also, a minor buglet: if we do ANALYZE (col1), then ANALYZE (col2) a
partition, then we repeatedly propagate the counts to the parent table,
so we would cause the parent to be analyzed more times than it should.
Sounds like we should not send the ancestor list when a column list is
given to manual analyze. I haven't verified this, however.)

Are you sure? I haven't tried, but shouldn't this be prevented by only
sending the delta between the current and last reported value?

I did try, and yes it behaves as you say.

OK, good.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#77Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tomas Vondra (#76)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-08, Tomas Vondra wrote:

On 4/8/21 5:27 PM, Alvaro Herrera wrote:

Same as for any other relation: ANALYZE would set it, after it's done
scanning the table. We would to make sure that nothing resets it to
empty, though, and that it doesn't cause issues elsewhere. (The patch I
sent contains the minimal change to make it work, but of course that's
missing having other pieces of code maintain it.)

So ANALYZE would inspect the child relations, sum the reltuples and set
it for the parent? IMO that'd be problematic because it'd mean we're
comparing the current number of changes with reltuples value which may
be arbitrarily stale (if we haven't analyzed the parent for a while).

What? Not at all. reltuples would be set by ANALYZE on one run, and
then the value is available for the future autovacuum run. That's how
it works for regular tables too, so I'm not sure what you problem have
with that. The (possibly stale) reltuples value is multiplied by the
scale factor, and added to the analyze_threshold value, and that's
compared with the current changes_since_analyze to determine whether to
analyze or not.

That's essentially the issue I described when explaining why I think the
code needs to propagate the changes, reread the stats and then evaluate
which relations need vacuuming. It's similar to the issue of comparing
old changes_since_analyze vs. current reltuples, which is why the code
is rereading the stats before checking the thresholds. This time it's
the opposite direction - the reltuples might be stale.

Well, I don't think the issue is the same. reltuples is always stale,
even for regular tables, because that's just how it works.
changes_since_analyze is not stale for regular tables, and that's why it
makes sense to propagate it from partitions to ancestors prior to
checking the analyze condition.

FWIW I think the current refresh logic is not quite correct, because
autovac_refresh_stats does some throttling (STATS_READ_DELAY). It
probably needs a "force" parameter to ensure it actually reads the
current stats in this one case.

Hmm ... good catch, but actually that throttling only applies to the
launcher. do_autovacuum runs in the worker, so there's no throttling.

--
�lvaro Herrera 39�49'30"S 73�17'W

#78Justin Pryzby
pryzby@telsasoft.com
In reply to: Alvaro Herrera (#72)
Re: Autovacuum on partitioned table (autoanalyze)

On Thu, Apr 08, 2021 at 01:20:14AM -0400, Alvaro Herrera wrote:

On 2021-Apr-07, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

Pushed with this approach. Thanks for persisting with this.

commit 0827e8af70f4653ba17ed773f123a60eadd9f9c9
| This also introduces necessary reloptions support for partitioned tables
| (autovacuum_enabled, autovacuum_analyze_scale_factor,
| autovacuum_analyze_threshold). It's unclear how best to document this
| aspect.

At least this part needs to be updated - see also ed62d3737.

doc/src/sgml/ref/create_table.sgml- The storage parameters currently
doc/src/sgml/ref/create_table.sgml- available for tables are listed below.
...
doc/src/sgml/ref/create_table.sgml: Specifying these parameters for partitioned tables is not supported,
doc/src/sgml/ref/create_table.sgml- but you may specify them for individual leaf partitions.

--
Justin

#79Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Justin Pryzby (#78)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Apr-08, Justin Pryzby wrote:

commit 0827e8af70f4653ba17ed773f123a60eadd9f9c9
| This also introduces necessary reloptions support for partitioned tables
| (autovacuum_enabled, autovacuum_analyze_scale_factor,
| autovacuum_analyze_threshold). It's unclear how best to document this
| aspect.

At least this part needs to be updated - see also ed62d3737.

doc/src/sgml/ref/create_table.sgml- The storage parameters currently
doc/src/sgml/ref/create_table.sgml- available for tables are listed below.
...
doc/src/sgml/ref/create_table.sgml: Specifying these parameters for partitioned tables is not supported,
doc/src/sgml/ref/create_table.sgml- but you may specify them for individual leaf partitions.

Ah, thanks for pointing it out. How about the attached?

This new bit reads weird:

+    Most parameters are not supported on partitioned tables, with exceptions
+    noted below; you may specify them for individual leaf partitions.

Maybe "Most parameters are not supported on partitioned tables, with
exceptions noted below; you may specify others for individual leaf
partitions."

--
�lvaro Herrera 39�49'30"S 73�17'W

Attachments:

0001-document-reloptions-for-partitioned-tables.patchtext/x-diff; charset=us-asciiDownload
From 37a829ec7b9c46acbbdb02f231288e39d22fcd04 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 8 Apr 2021 17:53:22 -0400
Subject: [PATCH] document reloptions for partitioned tables

---
 doc/src/sgml/ref/create_table.sgml | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 44e50620fd..3cf355cc8d 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1369,8 +1369,8 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Specifying these parameters for partitioned tables is not supported,
-    but you may specify them for individual leaf partitions.
+    Most parameters are not supported on partitioned tables, with exceptions
+    noted below; you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
@@ -1452,6 +1452,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      If true, the autovacuum daemon will perform automatic <command>VACUUM</command>
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
+     This parameter is supported on partitioned tables.
      If false, this table will not be autovacuumed, except to prevent
      transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
      more about wraparound prevention.
@@ -1576,6 +1577,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      <para>
       Per-table value for <xref linkend="guc-autovacuum-analyze-threshold"/>
       parameter.
+     This parameter is supported on partitioned tables.
      </para>
     </listitem>
    </varlistentry>
@@ -1591,6 +1593,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      <para>
       Per-table value for <xref linkend="guc-autovacuum-analyze-scale-factor"/>
       parameter.
+     This parameter is supported on partitioned tables.
      </para>
     </listitem>
    </varlistentry>
-- 
2.20.1

#80Justin Pryzby
pryzby@telsasoft.com
In reply to: Alvaro Herrera (#79)
Re: Autovacuum on partitioned table (autoanalyze)

On Thu, Apr 08, 2021 at 05:56:25PM -0400, Alvaro Herrera wrote:

On 2021-Apr-08, Justin Pryzby wrote:

commit 0827e8af70f4653ba17ed773f123a60eadd9f9c9
| This also introduces necessary reloptions support for partitioned tables
| (autovacuum_enabled, autovacuum_analyze_scale_factor,
| autovacuum_analyze_threshold). It's unclear how best to document this
| aspect.

At least this part needs to be updated - see also ed62d3737.

doc/src/sgml/ref/create_table.sgml- The storage parameters currently
doc/src/sgml/ref/create_table.sgml- available for tables are listed below.
...
doc/src/sgml/ref/create_table.sgml: Specifying these parameters for partitioned tables is not supported,
doc/src/sgml/ref/create_table.sgml- but you may specify them for individual leaf partitions.

Ah, thanks for pointing it out. How about the attached?

This new bit reads weird:

+    Most parameters are not supported on partitioned tables, with exceptions
+    noted below; you may specify them for individual leaf partitions.

"Except where noted, these parameters are not supported on partitioned tables."

--
Justin

#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#80)
Re: Autovacuum on partitioned table (autoanalyze)

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Apr 08, 2021 at 05:56:25PM -0400, Alvaro Herrera wrote:

This new bit reads weird:

+    Most parameters are not supported on partitioned tables, with exceptions
+    noted below; you may specify them for individual leaf partitions.

"Except where noted, these parameters are not supported on partitioned
tables."

I think what it's trying to get at is

"Except where noted, these parameters are not supported on partitioned
tables. However, you can specify them on individual leaf partitions."

regards, tom lane

#82Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#72)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:

On 2021-Apr-07, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

Pushed with this approach. Thanks for persisting with this.

I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.

I have a few questions / concerns:

1) Somehow it seems like a violation to do stuff like
get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but
it feels a bit off. Would likely not be too hard to address, e.g. by just
putting some of pgstat_report_anl_ancestors in partition.c instead.

2) Why does it make sense that autovacuum sends a stats message for every
partition in the system that had any chances since the last autovacuum
cycle? On a database with a good number of objects / a short naptime we'll
often end up sending messages for the same set of tables from separate
workers, because they don't yet see the concurrent
tabentry->changes_since_analyze_reported.

3) What is the goal of the autovac_refresh_stats() after the loop doing
pgstat_report_anl_ancestors()? I think it'll be common that the stats
collector hasn't even processed the incoming messages by that point, not to
speak of actually having written out a new stats file. If it took less than
10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
backend_read_statsfile() will not wait for a new stats file to be written
out, and we'll just re-read the state we previously did.

It's pretty expensive to re-read the stats file in some workloads, so I'm a
bit concerned that we end up significantly increasing the amount of stats
updates/reads, without actually gaining anything reliable?

4) In the shared mem stats patch I went to a fair bit of trouble to try to get
rid of pgstat_vacuum_stat() (which scales extremely poorly to larger
systems). For that to work pending stats can only be "staged" while holding
a lock on a relation that prevents the relation from being concurrently
dropped (pending stats increment a refcount for the shared stats object,
which ensures that we don't loose track of the fact that a stats object has
been dropped, even when stats only get submitted later).

I'm not yet clear on how to make this work for
pgstat_report_anl_ancestors() - but I probably can find a way. But it does
feel a bit off to issue stats stuff for tables we're not sure still exist.

I'll go and read through the thread, but my first thought is that having a
hashtable in do_autovacuum() that contains stats for partitioned tables would
be a good bit more efficient than the current approach? We already have a
hashtable for each toast table, compared to that having a hashtable for each
partitioned table doesn't seem like it'd be a problem?

With a small bit of extra work that could even avoid the need for the
additional pass through pg_class. Do the partitioned table data-gathering as
part of the "collect main tables to vacuum" pass, and then do one of

a) do the perform-analyze decision purely off the contents of that
partioned-table hash
b) fetch the RELOID syscache entry by oid and then decide based on that
c) handle partioned tableas as part of the "check TOAST tables" pass - it's
not like we gain a meaningful amount of efficiency by using a ScanKey to
filter for RELKIND_TOASTVALUE, given that there's no index, and that an
index wouldn't commonly be useful given the percentage of toast tables in
pg_class

Partitioning makes it a bigger issue that do_autovacuum() does multiple passes
through pg_class (as it makes scenarios in which pg_class is large more
common), so I don't think it's great that partitioning also increases the
number of passes through pg_class.

Greetings,

Andres Freund

#83Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#82)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-07-22 13:54:58 -0700, Andres Freund wrote:

On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:

On 2021-Apr-07, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

Pushed with this approach. Thanks for persisting with this.

I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.

I have a few questions / concerns:

Another one, and I think this might warrant thinking about for v14:

Isn't this going to create a *lot* of redundant sampling? Especially if you
have any sort of nested partition tree. In the most absurd case a partition
with n parents will get sampled n times, solely due to changes to itself.

Look at the following example:

BEGIN;
DROP TABLE if exists p;
CREATE TABLE p (i int) partition by range(i);
CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i);
CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000);
CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000);
CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000);
CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000);
CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000);
-- create some initial data
INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps;
COMMIT;

UPDATE p_0_4 SET i = i;

Whenever the update is executed, all partitions will be sampled at least twice
(once for p and once for p_0), with p_0_4 sampled three times.

Of course, this is an extreme example, but it's not hard to imagine cases
where v14 will cause the number of auto-analyzes increase sufficiently to bog
down autovacuum to a problematic degree.

Additionally, while analyzing all child partitions for a partitioned tables
are AccessShareLock'ed at once. If a partition hierarchy has more than one
level, it actually is likely that multiple autovacuum workers will end up
processing the ancestors separately. This seems like it might contribute to
lock exhaustion issues with larger partition hierarchies?

Greetings,

Andres Freund

#84Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#83)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

CCing RMT because I think we need to do something about this for v14.

On 2021-07-27 19:23:42 -0700, Andres Freund wrote:

On 2021-07-22 13:54:58 -0700, Andres Freund wrote:

On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:

On 2021-Apr-07, Alvaro Herrera wrote:

OK, I bit the bullet and re-did the logic in the way I had proposed
earlier in the thread: do the propagation on the collector's side, by
sending only the list of ancestors: the collector can read the tuple
change count by itself, to add it to each ancestor. This seems less
wasteful. Attached is v16 which does it that way and seems to work
nicely under my testing.

Pushed with this approach. Thanks for persisting with this.

I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.

I have a few questions / concerns:

Another one, and I think this might warrant thinking about for v14:

Isn't this going to create a *lot* of redundant sampling? Especially if you
have any sort of nested partition tree. In the most absurd case a partition
with n parents will get sampled n times, solely due to changes to itself.

Look at the following example:

BEGIN;
DROP TABLE if exists p;
CREATE TABLE p (i int) partition by range(i);
CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i);
CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000);
CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000);
CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000);
CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000);
CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000);
-- create some initial data
INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps;
COMMIT;

UPDATE p_0_4 SET i = i;

Whenever the update is executed, all partitions will be sampled at least twice
(once for p and once for p_0), with p_0_4 sampled three times.

Of course, this is an extreme example, but it's not hard to imagine cases
where v14 will cause the number of auto-analyzes increase sufficiently to bog
down autovacuum to a problematic degree.

Additionally, while analyzing all child partitions for a partitioned tables
are AccessShareLock'ed at once. If a partition hierarchy has more than one
level, it actually is likely that multiple autovacuum workers will end up
processing the ancestors separately. This seems like it might contribute to
lock exhaustion issues with larger partition hierarchies?

I started to write a patch rejiggering autovacuum.c portion of this
change. While testing it I hit the case of manual ANALYZEs leaving
changes_since_analyze for partitioned tables in a bogus state - without a
minimally invasive way to fix that. After a bit of confused staring I realized
that the current code has a very similar problem:

Using the same setup as above:

INSERT INTO p VALUES (0,0); /* repeat as many times as desired */
ANALYZE p_0_0;

At this point the system will have lost track of the changes to p_0_0, unless
an autovacuum worker was launched between the INSERTs and the ANALYZE (which
would cause pgstat_report_anl_ancestors() to report the change count upwards).

There appears to be code trying to address that, but I don't see how it
ever does anything meaningful?

/*
* Now report ANALYZE to the stats collector. For regular tables, we do
* it only if not doing inherited stats. For partitioned tables, we only
* do it for inherited stats. (We're never called for not-inherited stats
* on partitioned tables anyway.)
*
* Reset the changes_since_analyze counter only if we analyzed all
* columns; otherwise, there is still work for auto-analyze to do.
*/
if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
pgstat_report_analyze(onerel, totalrows, totaldeadrows,
(va_cols == NIL));

/*
* If this is a manual analyze of all columns of a permanent leaf
* partition, and not doing inherited stats, also let the collector know
* about the ancestor tables of this partition. Autovacuum does the
* equivalent of this at the start of its run, so there's no reason to do
* it there.
*/
if (!inh && !IsAutoVacuumWorkerProcess() &&
(va_cols == NIL) &&
onerel->rd_rel->relispartition &&
onerel->rd_rel->relkind == RELKIND_RELATION &&
onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
{
pgstat_report_anl_ancestors(RelationGetRelid(onerel));
}

The pgstat_report_analyze() triggers pgstat_recv_analyze() to reset the
counter that pgstat_recv_anl_ancestors() would use to report changes
upwards:

/*
* If commanded, reset changes_since_analyze to zero. This forgets any
* changes that were committed while the ANALYZE was in progress, but we
* have no good way to estimate how many of those there were.
*/
if (msg->m_resetcounter)
{
tabentry->changes_since_analyze = 0;
tabentry->changes_since_analyze_reported = 0;
}

And if one instead inverts the order of pgstat_report_analyze() and
pgstat_report_anl_ancestors() one gets a slightly different problem: A manual
ANALYZE of the partition root results in the partition root having a non-zero
changes_since_analyze afterwards. expand_vacuum() causes child partitions to be
added to the list of relations, which *first* causes the partition root to be
analyzed, and *then* partitions. The partitions then report their
changes_since_analyze upwards.

I don't think the code as is is fit for v14. It looks like it was rewritten
with a new approach just before the freeze ([1]/messages/by-id/20210408032235.GA6842@alvherre.pgsql), and as far as I can tell the
concerns I quoted above weren't even discussed in the whole thread. Alvaro,
any comments?

Greetings,

Andres Freund

[1]: /messages/by-id/20210408032235.GA6842@alvherre.pgsql

#85Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#84)
Re: Autovacuum on partitioned table (autoanalyze)

On 7/29/21 9:03 PM, Andres Freund wrote:

Hi,

CCing RMT because I think we need to do something about this for v14.

Thanks. We are now aware of it.

[...]

I don't think the code as is is fit for v14. It looks like it was rewritten
with a new approach just before the freeze ([1]), and as far as I can tell the
concerns I quoted above weren't even discussed in the whole thread. Alvaro,
any comments?

I discussed this briefly with Alvaro late last night. He's now aware of
the issue, but I believe he's away for some days, and probably won't be
able to respond until his return.

Sorry I don't have more news, but I didn't want anyone thinking this was
being ignored.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#86Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#84)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

At Thu, 29 Jul 2021 18:03:55 -0700, Andres Freund <andres@anarazel.de> wrote in

And if one instead inverts the order of pgstat_report_analyze() and
pgstat_report_anl_ancestors() one gets a slightly different problem: A manual
ANALYZE of the partition root results in the partition root having a non-zero
changes_since_analyze afterwards. expand_vacuum() causes child partitions to be
added to the list of relations, which *first* causes the partition root to be
analyzed, and *then* partitions. The partitions then report their
changes_since_analyze upwards.

For the last behavior, as Andres suggested, the scan order need to be
reversed (or to be in the same order with autovacuum). Since
find_all_inheritors scans breadth-first so just reversing the result
works. The breadth-first is currently not in the contract of the
interface of the function. I suppose we can add such a contract?

Finally, I ended up with the attached.

- reverse the relation order within a tree
- reverse the order of pgstat_report_analyze and pgstat_report_analyze.

Inheritance expansion is performed per-tree basis so it works fine
even if multiple relations are given to vacuum().

I don't think the code as is is fit for v14. It looks like it was rewritten
with a new approach just before the freeze ([1]), and as far as I can tell the
concerns I quoted above weren't even discussed in the whole thread. Alvaro,
any comments?

Greetings,

Andres Freund

[1] /messages/by-id/20210408032235.GA6842@alvherre.pgsql

FYI: this bahaves as the follows.

CREATE TABLE p (a int) PARTITION BY RANGE (a);
CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (200) PARTITION BY RANGE(a);
CREATE TABLE c11 PARTITION OF c1 FOR VALUES FROM (0) TO (100);
CREATE TABLE c12 PARTITION OF c1 FOR VALUES FROM (100) TO (200);
CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (200) TO (400) PARTITION BY RANGE(a);
CREATE TABLE c21 PARTITION OF c2 FOR VALUES FROM (200) TO (300);
CREATE TABLE c22 PARTITION OF c2 FOR VALUES FROM (300) TO (400);
INSERT INTO p (SELECT a FROM generate_series(0, 400 - 1) a, generate_series(0, 10) b);

INSERT INTO p (SELECT 200 FROM generate_series(0, 99));

SELECT relid, relname, n_mod_since_analyze FROM pg_stat_user_tables ORDER BY relid;
relid | relname | n_mod_since_analyze
-------+---------+---------------------
16426 | p | 0
16429 | c1 | 0
16432 | c11 | 0
16435 | c12 | 0
16438 | c2 | 0
16441 | c21 | 100
16444 | c22 | 0
16447 | sa | 0
(8 rows)

After "ANALYZE c21;"
relid | relname | n_mod_since_analyze
-------+---------+---------------------
16426 | p | 100
16429 | c1 | 0
16432 | c11 | 0
16435 | c12 | 0
16438 | c2 | 100
16441 | c21 | 0
16444 | c22 | 0
16447 | sa | 0

After "ANALYZE c2;"
relid | relname | n_mod_since_analyze
-------+---------+---------------------
16426 | p | 100
16429 | c1 | 0
16432 | c11 | 0
16435 | c12 | 0
16438 | c2 | 0
16441 | c21 | 0
16444 | c22 | 0
16447 | sa | 0

After "ANALYZE p;"
(all zero)

However, this gives a strange-looking side-effect, which affected
regression results.

=# VACUUM ANALYZE p(a, a);
ERROR: column "a" of relation "c22" appears more than once

(Prevously it complained about p.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Fix-changes_since_analyze-s-motion-on-manual-analyze.patchtext/x-patch; charset=us-asciiDownload
From 16f7602f1b7755f288c508f1e57e0eae3c305813 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 4 Aug 2021 13:40:59 +0900
Subject: [PATCH] Fix changes_since_analyze's motion on manual analyze on
 partitioned tables

The analyze-stats machinery assumed bottom-to-top-ordered relation
scans but actually it was in the opposite order in manual
ANALYZE. Addition to that the current code tries to propagate
changes_since_analyze to parents after stats reporting which resets
the number to propagate.

As the result, when doing manual ANALYZE on a partition,
changes_since_analyze vanishes instead of being propagated to
parents. On the other hand when doing that on a partitioned tables,
the leaf relations end up with having bogus stats values.

To fix this, reverse the order relations on running manual ANALYZE and
move stats-reporting after stats-propagation.
---
 src/backend/catalog/pg_inherits.c    |  3 ++-
 src/backend/commands/analyze.c       | 26 +++++++++++++-------------
 src/backend/commands/vacuum.c        | 25 +++++++++++++++----------
 src/test/regress/expected/vacuum.out | 18 +++++++++---------
 4 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index ae990d4877..3451578580 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -239,7 +239,8 @@ find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
 /*
  * find_all_inheritors -
  *		Returns a list of relation OIDs including the given rel plus
- *		all relations that inherit from it, directly or indirectly.
+ *		all relations that inherit from it, directly or indirectly, in
+ *		breadth-first ordering.
  *		Optionally, it also returns the number of parents found for
  *		each such relation within the inheritance tree rooted at the
  *		given rel.
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0c9591415e..414da6630b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -682,19 +682,6 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							in_outer_xact);
 	}
 
-	/*
-	 * Now report ANALYZE to the stats collector.  For regular tables, we do
-	 * it only if not doing inherited stats.  For partitioned tables, we only
-	 * do it for inherited stats. (We're never called for not-inherited stats
-	 * on partitioned tables anyway.)
-	 *
-	 * Reset the changes_since_analyze counter only if we analyzed all
-	 * columns; otherwise, there is still work for auto-analyze to do.
-	 */
-	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-							  (va_cols == NIL));
-
 	/*
 	 * If this is a manual analyze of all columns of a permanent leaf
 	 * partition, and not doing inherited stats, also let the collector know
@@ -711,6 +698,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
 	}
 
+	/*
+	 * Now report ANALYZE to the stats collector.  For regular tables, we do
+	 * it only if not doing inherited stats.  For partitioned tables, we only
+	 * do it for inherited stats. (We're never called for not-inherited stats
+	 * on partitioned tables anyway.)
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+							  (va_cols == NIL));
+
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
 	 *
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5c4bc15b44..0a378487fb 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -826,13 +826,15 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 		ReleaseSysCache(tuple);
 
 		/*
-		 * If it is, make relation list entries for its partitions.  Note that
-		 * the list returned by find_all_inheritors() includes the passed-in
-		 * OID, so we have to skip that.  There's no point in taking locks on
-		 * the individual partitions yet, and doing so would just add
-		 * unnecessary deadlock risk.  For this last reason we do not check
-		 * yet the ownership of the partitions, which get added to the list to
-		 * process.  Ownership will be checked later on anyway.
+		 * If it is, make relation list entries for its partitions in a
+		 * bottom-to-top manner. Note that the list returned by
+		 * find_all_inheritors() is in top-to-bottom ordering and includes the
+		 * passed-in OID, so we have to reverse the order and skip the
+		 * passed-in OID.  There's no point in taking locks on the individual
+		 * partitions yet, and doing so would just add unnecessary deadlock
+		 * risk.  For this last reason we do not check yet the ownership of the
+		 * partitions, which get added to the list to process.  Ownership will
+		 * be checked later on anyway.
 		 */
 		if (include_parts)
 		{
@@ -852,9 +854,12 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 				 * later.
 				 */
 				oldcontext = MemoryContextSwitchTo(vac_context);
-				vacrels = lappend(vacrels, makeVacuumRelation(NULL,
-															  part_oid,
-															  vrel->va_cols));
+
+				/* Make the order reversed */
+				vacrels = lcons(makeVacuumRelation(NULL,
+												   part_oid,
+												   vrel->va_cols),
+								vacrels);
 				MemoryContextSwitchTo(oldcontext);
 			}
 		}
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 3e70e4c788..5b9726225e 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -196,9 +196,9 @@ VACUUM (FULL) vacparted;
 VACUUM (FREEZE) vacparted;
 -- check behavior with duplicate column mentions
 VACUUM ANALYZE vacparted(a,b,a);
-ERROR:  column "a" of relation "vacparted" appears more than once
+ERROR:  column "a" of relation "vacparted1" appears more than once
 ANALYZE vacparted(a,b,b);
-ERROR:  column "b" of relation "vacparted" appears more than once
+ERROR:  column "b" of relation "vacparted1" appears more than once
 -- partitioned table with index
 CREATE TABLE vacparted_i (a int primary key, b varchar(100))
   PARTITION BY HASH (a);
@@ -239,7 +239,7 @@ ANALYZE vacparted (b), vactst;
 ANALYZE vactst, does_not_exist, vacparted;
 ERROR:  relation "does_not_exist" does not exist
 ANALYZE vactst (i), vacparted (does_not_exist);
-ERROR:  column "does_not_exist" of relation "vacparted" does not exist
+ERROR:  column "does_not_exist" of relation "vacparted1" does not exist
 ANALYZE vactst, vactst;
 BEGIN;  -- ANALYZE behaves differently inside a transaction block
 ANALYZE vactst, vactst;
@@ -319,24 +319,24 @@ WARNING:  skipping "pg_authid" --- only superuser can vacuum it
 -- independently.
 VACUUM vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can vacuum it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
 ANALYZE vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can analyze it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
 VACUUM (ANALYZE) vacowned_parted;
 WARNING:  skipping "vacowned_parted" --- only table or database owner can vacuum it
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part2;
@@ -389,22 +389,22 @@ ALTER TABLE vacowned_parted OWNER TO regress_vacuum;
 ALTER TABLE vacowned_part1 OWNER TO CURRENT_USER;
 SET ROLE regress_vacuum;
 VACUUM vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
 ANALYZE vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can analyze it
 ANALYZE vacowned_part2;
 WARNING:  skipping "vacowned_part2" --- only table or database owner can analyze it
 VACUUM (ANALYZE) vacowned_parted;
-WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 WARNING:  skipping "vacowned_part2" --- only table or database owner can vacuum it
+WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part1;
 WARNING:  skipping "vacowned_part1" --- only table or database owner can vacuum it
 VACUUM (ANALYZE) vacowned_part2;
-- 
2.27.0

#87Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#83)
Re: Autovacuum on partitioned table (autoanalyze)

Hi

On 2021-Jul-27, Andres Freund wrote:

Isn't this going to create a *lot* of redundant sampling? Especially if you
have any sort of nested partition tree. In the most absurd case a partition
with n parents will get sampled n times, solely due to changes to itself.

It seems to me that you're barking up the wrong tree on this point.
This problem you describe is not something that was caused by this
patch; ANALYZE has always worked like this. We have discussed the idea
of avoiding redundant sampling, but it's clear that it is not a simple
problem, and solving it was not in scope for this patch.

Additionally, while analyzing all child partitions for a partitioned tables
are AccessShareLock'ed at once. If a partition hierarchy has more than one
level, it actually is likely that multiple autovacuum workers will end up
processing the ancestors separately. This seems like it might contribute to
lock exhaustion issues with larger partition hierarchies?

I agree this seems a legitimate problem.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#88Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#87)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-08-09 16:02:33 -0400, Alvaro Herrera wrote:

On 2021-Jul-27, Andres Freund wrote:

Isn't this going to create a *lot* of redundant sampling? Especially if you
have any sort of nested partition tree. In the most absurd case a partition
with n parents will get sampled n times, solely due to changes to itself.

It seems to me that you're barking up the wrong tree on this point.
This problem you describe is not something that was caused by this
patch; ANALYZE has always worked like this. We have discussed the idea
of avoiding redundant sampling, but it's clear that it is not a simple
problem, and solving it was not in scope for this patch.

I don't agree. There's a difference between this happening after a manual
ANALYZE on partition roots, and this continuously happening in production
workloads due to auto-analyzes...

Greetings,

Andres Freund

#89Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#82)
Re: Autovacuum on partitioned table (autoanalyze)

Hello,

On 2021-Jul-22, Andres Freund wrote:

1) Somehow it seems like a violation to do stuff like
get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but
it feels a bit off. Would likely not be too hard to address, e.g. by just
putting some of pgstat_report_anl_ancestors in partition.c instead.

I understand the complain about this being a modularity violation -- the
point being that pgstat.c has no business accessing system catalogs at all.
Before this function, all pgstat_report_* functions were just assembling
a message from counters accumulated somewhere and sending the bytes to
the collector, and this new function is a deviation from that.

It seems that we could improve this by having a function (maybe in
partition.c as you propose), something like

static void
report_partition_ancestors(Oid relid)
{
ancestors = get_partition_ancestors( ... );
array = palloc(sizeof(Oid) * list_length(ancestors));
foreach(lc, ancestors)
{
array[i++] = lfirst_oid(lc);
}
pgstat_report_partition_ancestors(oid, array);
}

and then pgstat.c works with the given array without having to consult
system catalogs.

2) Why does it make sense that autovacuum sends a stats message for every
partition in the system that had any [changes] since the last autovacuum
cycle? On a database with a good number of objects / a short naptime we'll
often end up sending messages for the same set of tables from separate
workers, because they don't yet see the concurrent
tabentry->changes_since_analyze_reported.

The traffic could be large, yeah, and I agree it seems undesirable. If
collector kept a record of the list of ancestors of each table, then we
wouldn't need to do this (we would have to know if collector knows a
particular partition or not, though ... I have no ideas on that.)

3) What is the goal of the autovac_refresh_stats() after the loop doing
pgstat_report_anl_ancestors()? I think it'll be common that the stats
collector hasn't even processed the incoming messages by that point, not to
speak of actually having written out a new stats file. If it took less than
10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
backend_read_statsfile() will not wait for a new stats file to be written
out, and we'll just re-read the state we previously did.

It's pretty expensive to re-read the stats file in some workloads, so I'm a
bit concerned that we end up significantly increasing the amount of stats
updates/reads, without actually gaining anything reliable?

This is done once per autovacuum run and the point is precisely to let
the next block absorb the updates that were sent. In manual ANALYZE we
do it to inform future autovacuum runs.

Note that the PGSTAT_RETRY_DELAY limit is used by the autovac launcher
only, and this code is running in the worker; we do flush out the old
data. Yes, it's expensive, but we're not doing it once per table, just
once per worker run.

4) In the shared mem stats patch I went to a fair bit of trouble to try to get
rid of pgstat_vacuum_stat() (which scales extremely poorly to larger
systems). For that to work pending stats can only be "staged" while holding
a lock on a relation that prevents the relation from being concurrently
dropped (pending stats increment a refcount for the shared stats object,
which ensures that we don't loose track of the fact that a stats object has
been dropped, even when stats only get submitted later).

I'm not yet clear on how to make this work for
pgstat_report_anl_ancestors() - but I probably can find a way. But it does
feel a bit off to issue stats stuff for tables we're not sure still exist.

I assume you refer to locking the *partition*, right? You're not
talking about locking the ancestor mentioned in the message. I don't
know how does the shmem-collector work, but it shouldn't be a problem
that an ancestor goes away (ALTER TABLE parent DETACH; DROP TABLE
parent); as long as you've kept a lock on the partition, it should be
fine. Or am I misinterpreting what you mean?

I'll go and read through the thread, but my first thought is that having a
hashtable in do_autovacuum() that contains stats for partitioned tables would
be a good bit more efficient than the current approach? We already have a
hashtable for each toast table, compared to that having a hashtable for each
partitioned table doesn't seem like it'd be a problem?

With a small bit of extra work that could even avoid the need for the
additional pass through pg_class. Do the partitioned table data-gathering as
part of the "collect main tables to vacuum" pass, and then do one of

I'll have to re-read the thread to remember why did I make it a separate
pass. I think I did it that way because otherwise there was a
requirement on the pg_class scan order. (Some earlier version of the
patch did not have a separate pass and there was some problem or other.
Maybe you're right that a hash table is sufficient.)

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"We're here to devour each other alive" (Hobbes)

#90Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#88)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-09, Andres Freund wrote:

I don't agree. There's a difference between this happening after a manual
ANALYZE on partition roots, and this continuously happening in production
workloads due to auto-analyzes...

Hmm. That's not completely untrue.

I bring a radical proposal that may be sufficient to close this
particular hole. What if we made partition only affected their
top-level parents to become auto-analyzed, and not any intermediate
ancestors? Any intermediate partitioned partitions could be analyzed
manually if the user wished, and perhaps some reloption could enable
autovacuum to do it (with the caveat that it'd cause multiple sampling
of partitions). I don't yet have a clear picture on how to implement
this, but I'll explore it while waiting for opinions on the idea.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)

#91Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#90)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-10, Alvaro Herrera wrote:

I bring a radical proposal that may be sufficient to close this
particular hole. What if we made partition only affected their
top-level parents to become auto-analyzed, and not any intermediate
ancestors? Any intermediate partitioned partitions could be analyzed
manually if the user wished, and perhaps some reloption could enable
autovacuum to do it (with the caveat that it'd cause multiple sampling
of partitions). I don't yet have a clear picture on how to implement
this, but I'll explore it while waiting for opinions on the idea.

So, with this patch (a quick and dirty job) we no longer sample all
partitions twice; we no longer propagate the tuple counts to p_0.
We don't have stats on p_0 anymore, only on p and on the individual
partitions.

I didn't move the new #include to a more decent place because
1. that stuff is going to move to partition.c as a new function,
including the new include;
2. that new function also needs to read the reloptions for p_0 to allow
the user to enable stat acquisition for p_0 with "alter table p_0 set
(autovacuum_enabled=1)";
3. need to avoid reporting ancestors of a partition repeatedly, which
forestalls the performance objection about reading reloptions too
frequently.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Attachments:

0001-Propagate-counts-up-only-to-topmost-ancestor.patchtext/x-diff; charset=utf-8Download
From 064bc88bf94b6b4e1bfc16f0639e1500b17b9bf5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 10 Aug 2021 13:05:59 -0400
Subject: [PATCH] Propagate counts up only to topmost ancestor

Ignore intermediate partitions, to avoid redundant sampling of
partitions.  If needed, those intermediate partitions can be analyzed
manually.
---
 src/backend/postmaster/pgstat.c | 21 ++++++++++++++++-----
 src/include/pgstat.h            |  9 +++++++--
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1b54ef74eb..a003966cc8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1684,6 +1684,7 @@ pgstat_report_analyze(Relation rel,
  *	counts from the partition to its ancestors.  This is necessary so that
  *	other processes can decide whether to analyze the partitioned tables.
  */
+#include "utils/lsyscache.h"
 void
 pgstat_report_anl_ancestors(Oid relid)
 {
@@ -1700,19 +1701,25 @@ pgstat_report_anl_ancestors(Oid relid)
 	foreach(lc, ancestors)
 	{
 		Oid			ancestor = lfirst_oid(lc);
+		bool		ispartition;
 
-		msg.m_ancestors[msg.m_nancestors] = ancestor;
-		if (++msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
+		ispartition = get_rel_relispartition(ancestor);
+
+		msg.m_ancestors[msg.m_nancestors].m_ancestor_id = ancestor;
+		msg.m_ancestors[msg.m_nancestors].m_propagate_up = !ispartition;
+		msg.m_nancestors++;
+
+		if (msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
 		{
 			pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-						msg.m_nancestors * sizeof(Oid));
+						msg.m_nancestors * sizeof(PgStat_AnlAncestor));
 			msg.m_nancestors = 0;
 		}
 	}
 
 	if (msg.m_nancestors > 0)
 		pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-					msg.m_nancestors * sizeof(Oid));
+					msg.m_nancestors * sizeof(PgStat_AnlAncestor));
 
 	list_free(ancestors);
 }
@@ -5415,9 +5422,13 @@ pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len)
 
 	for (int i = 0; i < msg->m_nancestors; i++)
 	{
-		Oid			ancestor_relid = msg->m_ancestors[i];
+		Oid			ancestor_relid;
 		PgStat_StatTabEntry *ancestor;
 
+		if (!msg->m_ancestors[i].m_propagate_up)
+			continue;
+
+		ancestor_relid = msg->m_ancestors[i].m_ancestor_id;
 		ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true);
 		ancestor->changes_since_analyze +=
 			tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2068a68a5f..46ef88e73b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -438,9 +438,14 @@ typedef struct PgStat_MsgAnalyze
  *								analyze counters
  * ----------
  */
+typedef struct PgStat_AnlAncestor
+{
+	Oid			m_ancestor_id;
+	bool		m_propagate_up;
+} PgStat_AnlAncestor;
 #define PGSTAT_NUM_ANCESTORENTRIES    \
 	((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(int))	\
-	 / sizeof(Oid))
+	 / sizeof(PgStat_AnlAncestor))
 
 typedef struct PgStat_MsgAnlAncestors
 {
@@ -448,7 +453,7 @@ typedef struct PgStat_MsgAnlAncestors
 	Oid			m_databaseid;
 	Oid			m_tableoid;
 	int			m_nancestors;
-	Oid			m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
+	PgStat_AnlAncestor m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
 } PgStat_MsgAnlAncestors;
 
 /* ----------
-- 
2.20.1

#92Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alvaro Herrera (#89)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-09, Alvaro Herrera wrote:

3) What is the goal of the autovac_refresh_stats() after the loop doing
pgstat_report_anl_ancestors()? I think it'll be common that the stats
collector hasn't even processed the incoming messages by that point, not to
speak of actually having written out a new stats file. If it took less than
10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
backend_read_statsfile() will not wait for a new stats file to be written
out, and we'll just re-read the state we previously did.

It's pretty expensive to re-read the stats file in some workloads, so I'm a
bit concerned that we end up significantly increasing the amount of stats
updates/reads, without actually gaining anything reliable?

This is done once per autovacuum run and the point is precisely to let
the next block absorb the updates that were sent. In manual ANALYZE we
do it to inform future autovacuum runs.

Note that the PGSTAT_RETRY_DELAY limit is used by the autovac launcher
only, and this code is running in the worker; we do flush out the old
data. Yes, it's expensive, but we're not doing it once per table, just
once per worker run.

I misunderstood what you were talking about here -- I thought it was
about the delay in autovac_refresh_stats (STATS_READ_DELAY, 1s). Now
that I look at this again I realize what your point is, and you're
right, there isn't sufficient time for the collector to absorb the
messages we sent before the next scan pg_class scan starts.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)

#93Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#82)
Re: Autovacuum on partitioned table (autoanalyze)

After thinking about the described issues for a while, my proposal is to
completely revamp the way this feature works. See below.

Now, the proposal seems awfully invasive, but it's *the* way I see to
avoid the pgstat traffic. For pg14, maybe we can live with it, and just
use the smaller patches that Horiguchi-san and I have posted, which
solve the other issues; also, Euler Taveira suggested that we could add
a reloption to turn the feature off completely for some tables (maybe
make it off by default and have a reloption to turn it on for specific
partition hierarchies), so that it doesn't cause unduly pain for people
with large partitioning hierarchies.

* PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is
the OID of a single partitioned ancestor, to which the changed-tuple
counts are propagated up.
Normally this is the topmost ancestor; but if the user wishes some
intermediate ancestor to receive the counts they can use
ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on).

* Corollary 1: for the normal case of single-level partitioning, the
parent partitioned table behaves as currently.

* Corollary 2: for multi-level partitioning with no especially
configured intermediate ancestors, only the leaf partitions and the
top-level partitioned table will be analyzed. Intermediate ancestors
are ignored by autovacuum.

* Corollary 3: for multi-level partitioning with some intermediate
ancestor(s) marked as autovacuum_enabled=on, that ancestor will
receive all the counts from all of its partitions, so it will get
analyzed itself; and it'll also forward those counts up to its
report-ancestor.

* On ALTER TABLE .. ATTACH PARTITION or CREATE TABLE PARTITION AS,
we send a message to collector with the analyze-ancestor OID.

* Backends running manual ANALYZE as well as autovacuum will examine
each table's "relispartition" flag and its pgstat table entry; if it
is a partition and doesn't have reportAncestorOid set, determine which
ancestor should analyze counts be reported to; include this OID in the
regular PgStat_MsgAnalyze. This fixes the situation after a crash or
other stats reset. Also, it's not unduly expensive to do, because
it's only in the rare case that the value sent by ATTACH was lost.

* Possible race condition in the previous step may cause multiple
backends to send the same info. Not a serious problem so we don't try
to handle it.

* When tuple change counts for a partition are received by
pgstat_recv_tabstat, they are propagated up to the indicated parent
table in addition to being saved in the table itself.
(Bonus points: when a table is attached or detached as a partition,
the live tuples count is propagated to the newly acquired parent.)

What do people think of this?

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/

#94Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#93)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-08-11 18:33:07 -0400, Alvaro Herrera wrote:

After thinking about the described issues for a while, my proposal is to
completely revamp the way this feature works. See below.

Now, the proposal seems awfully invasive, but it's *the* way I see to
avoid the pgstat traffic. For pg14, maybe we can live with it, and just
use the smaller patches that Horiguchi-san and I have posted, which
solve the other issues; also, Euler Taveira suggested that we could add
a reloption to turn the feature off completely for some tables (maybe
make it off by default and have a reloption to turn it on for specific
partition hierarchies), so that it doesn't cause unduly pain for people
with large partitioning hierarchies.

I think we should revert the changes for 14 - to me the feature clearly isn't
mature enough to be released.

* PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is
the OID of a single partitioned ancestor, to which the changed-tuple
counts are propagated up.
Normally this is the topmost ancestor; but if the user wishes some
intermediate ancestor to receive the counts they can use
ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on).

* Corollary 1: for the normal case of single-level partitioning, the
parent partitioned table behaves as currently.

* Corollary 2: for multi-level partitioning with no especially
configured intermediate ancestors, only the leaf partitions and the
top-level partitioned table will be analyzed. Intermediate ancestors
are ignored by autovacuum.

* Corollary 3: for multi-level partitioning with some intermediate
ancestor(s) marked as autovacuum_enabled=on, that ancestor will
receive all the counts from all of its partitions, so it will get
analyzed itself; and it'll also forward those counts up to its
report-ancestor.

This seems awfully confusing to me.

One fundamental issue here is that we separately build stats for partitioned
tables and partitions. Can we instead tackle this by reusing the stats for
partitions for the inheritance stats? I think it's a bit easier to do that
for partitioned tables than for old school inheritance roots, because there's
no other rows in partitioned tables.

Greetings,

Andres Freund

#95Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#94)
2 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Here is a proposal for 14. This patch has four main changes:

* The mod counts are only propagated to the topmost parent, not to each ancestor. This means that we'll only analyze the topmost partitioned table and not each intermediate partitioned table; seems a good compromise to avoid sampling all partitions multiple times per round.

* One pgstat message is sent containing many partition/parent pairs, not just one. This reduces the number of messages sent. 123 partitions fit in one message (messages are 1000 bytes). This is done once per autovacuum worker run, so it shouldn't be too bad.

* There's a sleep between sending the message and re-reading stats. It would be great to have a mechanism by which pgstat collector says "I've received and processed up to this point", but we don't have that; what we can do is sleep PGSTAT_STAT_INTERVAL and then reread the file, so we're certain that the file we read is at least as new as that time. This is far longer than it takes to process the messages. Note that if the messages do take longer than that to be processed by the collector, it's not a big loss anyway; those tables will be processed by the next autovacuum run.

* I changed vacuum_expand_rel to put the main-rel OID at the end. (This is a variation of Horiguchi-san proposed patch; instead of making the complete list be in the opposite order, it's just that one OID that appears at the other end). This has the same effect as his patch: any error reports thrown by vacuum/analyze mention the first partition rather than the main table. This part is in 0002 and I'm not totally convinced it's a sane idea.

Minor changes:
* I reduced autovacuum from three passes over pg_class to two passes, per your observation that we can acquire toast association together with processing partitions, and then use that in the second pass to collect everything.

* I moved the catalog-accessing code to partition.c, so we don't need to have pgstat.c doing it.

Some doc changes are pending, and some more commentary in parts of the code, but I think this is much more sensible. I do lament the lack of a syscache for pg_inherits.

Attachments:

0001-Propagate-counts-up-only-to-topmost-ancestor.patchtext/x-patch; name=0001-Propagate-counts-up-only-to-topmost-ancestor.patchDownload
From 3e904de5f15cfc69692ad2aea64c0034445d957e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 10 Aug 2021 13:05:59 -0400
Subject: [PATCH 1/2] Propagate counts up only to topmost ancestor

Ignore intermediate partitions, to avoid redundant sampling of
partitions.  If needed, those intermediate partitions can be analyzed
manually.
---
 src/backend/catalog/partition.c     |  53 +++++++
 src/backend/commands/analyze.c      |   3 +-
 src/backend/postmaster/autovacuum.c | 222 ++++++++++++++--------------
 src/backend/postmaster/pgstat.c     |  60 ++++----
 src/include/catalog/partition.h     |   1 +
 src/include/pgstat.h                |  21 ++-
 6 files changed, 211 insertions(+), 149 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 790f4ccb92..017d5ba5a2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -26,6 +26,7 @@
 #include "nodes/makefuncs.h"
 #include "optimizer/optimizer.h"
 #include "partitioning/partbounds.h"
+#include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/fmgroids.h"
 #include "utils/partcache.h"
@@ -166,6 +167,58 @@ get_partition_ancestors_worker(Relation inhRel, Oid relid, List **ancestors)
 	get_partition_ancestors_worker(inhRel, parentOid, ancestors);
 }
 
+/*
+ * Inform pgstats collector about the topmost ancestor of
+ * each of the given partitions.
+ */
+void
+partition_analyze_report_ancestors(List *partitions)
+{
+	List	   *report_parts = NIL;
+	List	   *ancestors = NIL;
+	Relation	inhRel;
+	ListCell   *lc;
+
+	inhRel = table_open(InheritsRelationId, AccessShareLock);
+
+	/*
+	 * Search pg_inherits for the topmost ancestor of each given partition,
+	 * and if found, store both their OIDs in lists.
+	 *
+	 * By the end of this loop, partitions and ancestors are lists to be
+	 * read in parallel, where the i'th element of ancestors is the topmost
+	 * ancestor of the i'th element of partitions.
+	 */
+	foreach(lc, partitions)
+	{
+		Oid		partition_id = lfirst_oid(lc);
+		Oid		cur_relid;
+
+		cur_relid = partition_id;
+		for (;;)
+		{
+			bool	detach_pending;
+			Oid		parent_relid;
+
+			parent_relid = get_partition_parent_worker(inhRel, cur_relid,
+													   &detach_pending);
+			if ((!OidIsValid(parent_relid) || detach_pending) &&
+				cur_relid != partition_id)
+			{
+				report_parts = lappend_oid(report_parts, partition_id);
+				ancestors = lappend_oid(ancestors, cur_relid);
+				break;
+			}
+
+			cur_relid = parent_relid;
+		}
+	}
+
+	table_close(inhRel, AccessShareLock);
+
+	pgstat_report_anl_ancestors(report_parts, ancestors);
+}
+
 /*
  * index_get_partition
  *		Return the OID of index of the given partition that is a child
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0099a04bbe..c930e3e3cd 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -30,6 +30,7 @@
 #include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
+#include "catalog/partition.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_namespace.h"
@@ -708,7 +709,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		onerel->rd_rel->relkind == RELKIND_RELATION &&
 		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
 	{
-		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
+		partition_analyze_report_ancestors(list_make1_oid(RelationGetRelid(onerel)));
 	}
 
 	/*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 59d348b062..f686f1d39f 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -74,6 +74,7 @@
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
@@ -180,7 +181,6 @@ typedef struct avw_dbase
 typedef struct av_relation
 {
 	Oid			ar_toastrelid;	/* hash key - must be first */
-	Oid			ar_relid;
 	bool		ar_hasrelopts;
 	AutoVacOpts ar_reloptions;	/* copy of AutoVacOpts from the main table's
 								 * reloptions, or NULL if none */
@@ -1959,18 +1959,17 @@ do_autovacuum(void)
 	Form_pg_database dbForm;
 	List	   *table_oids = NIL;
 	List	   *orphan_oids = NIL;
+	List	   *report_ancestors = NIL;
 	HASHCTL		ctl;
 	HTAB	   *table_toast_map;
 	ListCell   *volatile cell;
 	PgStat_StatDBEntry *shared;
 	PgStat_StatDBEntry *dbentry;
 	BufferAccessStrategy bstrategy;
-	ScanKeyData key;
 	TupleDesc	pg_class_desc;
 	int			effective_multixact_freeze_max_age;
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
-	bool		updated = false;
 	int			i;
 
 	/*
@@ -2056,19 +2055,17 @@ do_autovacuum(void)
 	/*
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
-	 * We do this in three passes: First we let pgstat collector know about
+	 * We do this in two passes: First we let pgstat collector know about
 	 * the partitioned table ancestors of all partitions that have recently
 	 * acquired rows for analyze.  This informs the second pass about the
-	 * total number of tuple count in partitioning hierarchies.
+	 * total number of tuple count in partitioning hierarchies.  In this scan
+	 * we also collect the association of main tables to toast tables.
 	 *
 	 * On the second pass, we collect the list of plain relations,
-	 * materialized views and partitioned tables.  On the third one we collect
-	 * TOAST tables.
-	 *
-	 * The reason for doing the third pass is that during it we want to use
-	 * the main relation's pg_class.reloptions entry if the TOAST table does
-	 * not have any, and we cannot obtain it unless we know beforehand what's
-	 * the main table OID.
+	 * materialized views, partitioned tables.  Also do TOAST tables,
+	 * using the association collected during the first scan (we want to
+	 * apply the main table's reloptions entry in case the TOAST table
+	 * doesn't have any.)
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2079,43 +2076,90 @@ do_autovacuum(void)
 	/*
 	 * First pass: before collecting the list of tables to vacuum, let stat
 	 * collector know about partitioned-table ancestors of each partition.
+	 * Also capture the TOAST-to-main-table association.
 	 */
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
 	{
 		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
-		Oid			relid = classForm->oid;
-		PgStat_StatTabEntry *tabentry;
 
-		/* Only consider permanent leaf partitions */
-		if (!classForm->relispartition ||
-			classForm->relkind == RELKIND_PARTITIONED_TABLE ||
-			classForm->relpersistence == RELPERSISTENCE_TEMP)
+		/* Ignore all temp tables here */
+		if (classForm->relpersistence == RELPERSISTENCE_TEMP)
 			continue;
 
 		/*
-		 * No need to do this for partitions that haven't acquired any rows.
+		 * Remember TOAST associations for the second pass.  Note: we must do
+		 * this whether or not the table is going to be vacuumed, because we
+		 * don't automatically vacuum toast tables along the parent table.
 		 */
-		tabentry = pgstat_fetch_stat_tabentry(relid);
-		if (tabentry &&
-			tabentry->changes_since_analyze -
-			tabentry->changes_since_analyze_reported > 0)
+		if ((classForm->relkind == RELKIND_RELATION ||
+			 classForm->relkind == RELKIND_MATVIEW) &&
+			OidIsValid(classForm->reltoastrelid))
 		{
-			pgstat_report_anl_ancestors(relid);
-			updated = true;
+			av_relation *hentry;
+			bool		found;
+
+			hentry = hash_search(table_toast_map,
+								 &classForm->reltoastrelid,
+								 HASH_ENTER, &found);
+			if (!found)
+			{
+				AutoVacOpts *relopts;
+
+				/* hash_search already filled in the key */
+				relopts = extract_autovac_opts(tuple, pg_class_desc);
+				if (relopts)
+				{
+					memcpy(&hentry->ar_reloptions, relopts,
+						   sizeof(AutoVacOpts));
+					hentry->ar_hasrelopts = true;
+				}
+				else
+					hentry->ar_hasrelopts = false;
+			}
+		}
+
+		/* For the below, only consider leaf partitions. */
+		if (classForm->relispartition &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+		{
+			PgStat_StatTabEntry *tabentry;
+
+			tabentry = pgstat_fetch_stat_tabentry(classForm->oid);
+			if (tabentry &&
+				tabentry->changes_since_analyze -
+				tabentry->changes_since_analyze_reported > 0)
+			{
+				report_ancestors = lappend_oid(report_ancestors, classForm->oid);
+			}
 		}
 	}
 
-	/* Acquire fresh stats for the next passes, if needed */
-	if (updated)
+	/*
+	 * Send the partition-ancestor report to the collector, then acquire fresh
+	 * stats for what comes next.
+	 */
+	if (report_ancestors != NIL)
 	{
+		partition_analyze_report_ancestors(report_ancestors);
+		list_free(report_ancestors);
+
+		/*
+		 * XXX some clever wait goes here, so that collector has time to digest
+		 * the above updates.  I have no better ideas than just sleeping.  We
+		 * hope this is correct because the next backend_read_statsfile will
+		 * only succeed if the file read is as new as the current timestamp.
+		 * This hopefully gives sufficient time for the messages we just sent
+		 * to be processed.
+		 */
+		pg_usleep(1000 * 500);	/* PGSTAT_STAT_INTERVAL */
+
 		autovac_refresh_stats();
 		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 		shared = pgstat_fetch_stat_dbentry(InvalidOid);
 	}
 
 	/*
-	 * On the second pass, we collect main tables to vacuum, and also the main
-	 * table relid to TOAST relid mapping.
+	 * On the second pass we collect tables to vacuum.
 	 */
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
 	{
@@ -2127,13 +2171,46 @@ do_autovacuum(void)
 		bool		doanalyze;
 		bool		wraparound;
 
+		relid = classForm->oid;
+
+		if (classForm->relkind == RELKIND_TOASTVALUE &&
+			classForm->relpersistence != RELPERSISTENCE_TEMP)
+		{
+			/*
+			 * fetch reloptions -- if this toast table does not have them, try the
+			 * main rel
+			 */
+			relopts = extract_autovac_opts(tuple, pg_class_desc);
+			if (relopts == NULL)
+			{
+				av_relation *hentry;
+				bool		found;
+
+				hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+				if (found && hentry->ar_hasrelopts)
+					relopts = &hentry->ar_reloptions;
+			}
+
+			/* Fetch the pgstat entry for this table */
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
+
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* ignore analyze for toast tables */
+			if (dovacuum)
+				table_oids = lappend_oid(table_oids, relid);
+
+			continue;
+		}
+
 		if (classForm->relkind != RELKIND_RELATION &&
 			classForm->relkind != RELKIND_MATVIEW &&
 			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
-		relid = classForm->oid;
-
 		/*
 		 * Check if it is a temp table (presumably, of some other backend's).
 		 * We cannot safely process other backends' temp tables.
@@ -2172,89 +2249,6 @@ do_autovacuum(void)
 		/* Relations that need work are added to table_oids */
 		if (dovacuum || doanalyze)
 			table_oids = lappend_oid(table_oids, relid);
-
-		/*
-		 * Remember TOAST associations for the second pass.  Note: we must do
-		 * this whether or not the table is going to be vacuumed, because we
-		 * don't automatically vacuum toast tables along the parent table.
-		 */
-		if (OidIsValid(classForm->reltoastrelid))
-		{
-			av_relation *hentry;
-			bool		found;
-
-			hentry = hash_search(table_toast_map,
-								 &classForm->reltoastrelid,
-								 HASH_ENTER, &found);
-
-			if (!found)
-			{
-				/* hash_search already filled in the key */
-				hentry->ar_relid = relid;
-				hentry->ar_hasrelopts = false;
-				if (relopts != NULL)
-				{
-					hentry->ar_hasrelopts = true;
-					memcpy(&hentry->ar_reloptions, relopts,
-						   sizeof(AutoVacOpts));
-				}
-			}
-		}
-	}
-
-	table_endscan(relScan);
-
-	/* third pass: check TOAST tables */
-	ScanKeyInit(&key,
-				Anum_pg_class_relkind,
-				BTEqualStrategyNumber, F_CHAREQ,
-				CharGetDatum(RELKIND_TOASTVALUE));
-
-	relScan = table_beginscan_catalog(classRel, 1, &key);
-	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
-		PgStat_StatTabEntry *tabentry;
-		Oid			relid;
-		AutoVacOpts *relopts = NULL;
-		bool		dovacuum;
-		bool		doanalyze;
-		bool		wraparound;
-
-		/*
-		 * We cannot safely process other backends' temp tables, so skip 'em.
-		 */
-		if (classForm->relpersistence == RELPERSISTENCE_TEMP)
-			continue;
-
-		relid = classForm->oid;
-
-		/*
-		 * fetch reloptions -- if this toast table does not have them, try the
-		 * main rel
-		 */
-		relopts = extract_autovac_opts(tuple, pg_class_desc);
-		if (relopts == NULL)
-		{
-			av_relation *hentry;
-			bool		found;
-
-			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-			if (found && hentry->ar_hasrelopts)
-				relopts = &hentry->ar_reloptions;
-		}
-
-		/* Fetch the pgstat entry for this table */
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
-
-		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
-
-		/* ignore analyze for toast tables */
-		if (dovacuum)
-			table_oids = lappend_oid(table_oids, relid);
 	}
 
 	table_endscan(relScan);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1b54ef74eb..6af6b7ea45 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1679,42 +1679,42 @@ pgstat_report_analyze(Relation rel,
 /*
  * pgstat_report_anl_ancestors
  *
- *	Send list of partitioned table ancestors of the given partition to the
- *	collector.  The collector is in charge of propagating the analyze tuple
- *	counts from the partition to its ancestors.  This is necessary so that
+ *	Send a message to report the ancestor of each given partition.
+ *	The collector is in charge of propagating the analyze tuple
+ *	counts from the partition to its ancestor.  This is necessary so that
  *	other processes can decide whether to analyze the partitioned tables.
  */
 void
-pgstat_report_anl_ancestors(Oid relid)
+pgstat_report_anl_ancestors(List *partitions, List *ancestors)
 {
 	PgStat_MsgAnlAncestors msg;
-	List	   *ancestors;
-	ListCell   *lc;
+	ListCell   *lc1,
+			   *lc2;
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANL_ANCESTORS);
 	msg.m_databaseid = MyDatabaseId;
-	msg.m_tableoid = relid;
 	msg.m_nancestors = 0;
 
-	ancestors = get_partition_ancestors(relid);
-	foreach(lc, ancestors)
+	forboth(lc1, partitions, lc2, ancestors)
 	{
-		Oid			ancestor = lfirst_oid(lc);
+		msg.m_ancestors[msg.m_nancestors].m_partition_id = lfirst_oid(lc1);
+		msg.m_ancestors[msg.m_nancestors].m_ancestor_id = lfirst_oid(lc2);
+		msg.m_nancestors++;
 
-		msg.m_ancestors[msg.m_nancestors] = ancestor;
-		if (++msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
+		if (msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
 		{
 			pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-						msg.m_nancestors * sizeof(Oid));
+						msg.m_nancestors * sizeof(PgStat_AnlAncestor));
 			msg.m_nancestors = 0;
 		}
 	}
 
 	if (msg.m_nancestors > 0)
+	{
 		pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-					msg.m_nancestors * sizeof(Oid));
-
-	list_free(ancestors);
+					msg.m_nancestors * sizeof(PgStat_AnlAncestor));
+		msg.m_nancestors = 0;
+	}
 }
 
 /* --------
@@ -5403,28 +5403,36 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
+/* ------------
+ * pgstat_recv_anl_ancestors
+ *
+ *	Process an ANALYZE ANCESTORS message
+ * ------------
+ */
 static void
 pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len)
 {
 	PgStat_StatDBEntry *dbentry;
-	PgStat_StatTabEntry *tabentry;
-
-	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
 
+	dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+	if (dbentry == NULL)
+		return;
 	for (int i = 0; i < msg->m_nancestors; i++)
 	{
-		Oid			ancestor_relid = msg->m_ancestors[i];
+		PgStat_StatTabEntry *tabentry;
 		PgStat_StatTabEntry *ancestor;
 
-		ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true);
+		tabentry = pgstat_get_tab_entry(dbentry, msg->m_ancestors[i].m_partition_id, false);
+		if (tabentry == NULL)
+			continue;
+		ancestor = pgstat_get_tab_entry(dbentry, msg->m_ancestors[i].m_ancestor_id, true);
+		if (ancestor == NULL)
+			continue;
+
 		ancestor->changes_since_analyze +=
 			tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported;
+		tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
 	}
-
-	tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
-
 }
 
 /* ----------
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index c8c7bc1d99..6b16f92f22 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -22,6 +22,7 @@
 extern Oid	get_partition_parent(Oid relid, bool even_if_detached);
 extern List *get_partition_ancestors(Oid relid);
 extern Oid	index_get_partition(Relation partition, Oid indexId);
+extern void partition_analyze_report_ancestors(List *partitions);
 extern List *map_partition_varattnos(List *expr, int fromrel_varno,
 									 Relation to_rel, Relation from_rel);
 extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2068a68a5f..bbca2d7755 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -433,22 +433,27 @@ typedef struct PgStat_MsgAnalyze
 
 /* ----------
  * PgStat_MsgAnlAncestors		Sent by the backend or autovacuum daemon
- *								to inform partitioned tables that are
- *								ancestors of a partition, to propagate
+ *								to inform partitioned table that's
+ *								top-most ancestor of a partition, to propagate
  *								analyze counters
  * ----------
  */
-#define PGSTAT_NUM_ANCESTORENTRIES    \
-	((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(int))	\
-	 / sizeof(Oid))
+#define PGSTAT_NUM_ANCESTORENTRIES	\
+	((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int)) \
+	 / sizeof(PgStat_AnlAncestor))
+
+typedef struct PgStat_AnlAncestor
+{
+	Oid			m_partition_id;
+	Oid			m_ancestor_id;
+} PgStat_AnlAncestor;
 
 typedef struct PgStat_MsgAnlAncestors
 {
 	PgStat_MsgHdr m_hdr;
 	Oid			m_databaseid;
-	Oid			m_tableoid;
 	int			m_nancestors;
-	Oid			m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
+	PgStat_AnlAncestor m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
 } PgStat_MsgAnlAncestors;
 
 /* ----------
@@ -1038,7 +1043,7 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-extern void pgstat_report_anl_ancestors(Oid relid);
+extern void pgstat_report_anl_ancestors(List *partitions, List *ancestors);
 
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
-- 
2.20.1

0002-Have-expand_vacuum_rel-put-the-parent-table-last.patchtext/x-patch; name=0002-Have-expand_vacuum_rel-put-the-parent-table-last.patchDownload
From 705e795b5754295280edb26a3caf3627119c0e0e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 13 Aug 2021 14:41:47 -0400
Subject: [PATCH 2/2] Have expand_vacuum_rel put the parent table last

---
 src/backend/commands/vacuum.c        | 29 ++++++++++++++++++----------
 src/test/regress/expected/vacuum.out |  6 +++---
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5c4bc15b44..d94d32af2c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -760,6 +760,7 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 		Oid			relid;
 		HeapTuple	tuple;
 		Form_pg_class classForm;
+		bool		include_toprel = false;
 		bool		include_parts;
 		int			rvr_opts;
 
@@ -809,20 +810,15 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 		classForm = (Form_pg_class) GETSTRUCT(tuple);
 
 		/*
-		 * Make a returnable VacuumRelation for this rel if user is a proper
-		 * owner.
+		 * Decide whether to include the relation itself.  We'll put it at
+		 * the end of the list, if partitions are involved.
 		 */
 		if (vacuum_is_relation_owner(relid, classForm, options))
-		{
-			oldcontext = MemoryContextSwitchTo(vac_context);
-			vacrels = lappend(vacrels, makeVacuumRelation(vrel->relation,
-														  relid,
-														  vrel->va_cols));
-			MemoryContextSwitchTo(oldcontext);
-		}
-
+			include_toprel = true;
 
+		/* Decide if processing partitions is necessary */
 		include_parts = (classForm->relkind == RELKIND_PARTITIONED_TABLE);
+
 		ReleaseSysCache(tuple);
 
 		/*
@@ -859,6 +855,19 @@ expand_vacuum_rel(VacuumRelation *vrel, int options)
 			}
 		}
 
+		/*
+		 * Make a returnable VacuumRelation for this rel, if deemed possible
+		 * above.
+		 */
+		if (include_toprel)
+		{
+			oldcontext = MemoryContextSwitchTo(vac_context);
+			vacrels = lappend(vacrels, makeVacuumRelation(vrel->relation,
+														  relid,
+														  vrel->va_cols));
+			MemoryContextSwitchTo(oldcontext);
+		}
+
 		/*
 		 * Release lock again.  This means that by the time we actually try to
 		 * process the table, it might be gone or renamed.  In the former case
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 3e70e4c788..ee4e3fbf0a 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -196,9 +196,9 @@ VACUUM (FULL) vacparted;
 VACUUM (FREEZE) vacparted;
 -- check behavior with duplicate column mentions
 VACUUM ANALYZE vacparted(a,b,a);
-ERROR:  column "a" of relation "vacparted" appears more than once
+ERROR:  column "a" of relation "vacparted1" appears more than once
 ANALYZE vacparted(a,b,b);
-ERROR:  column "b" of relation "vacparted" appears more than once
+ERROR:  column "b" of relation "vacparted1" appears more than once
 -- partitioned table with index
 CREATE TABLE vacparted_i (a int primary key, b varchar(100))
   PARTITION BY HASH (a);
@@ -239,7 +239,7 @@ ANALYZE vacparted (b), vactst;
 ANALYZE vactst, does_not_exist, vacparted;
 ERROR:  relation "does_not_exist" does not exist
 ANALYZE vactst (i), vacparted (does_not_exist);
-ERROR:  column "does_not_exist" of relation "vacparted" does not exist
+ERROR:  column "does_not_exist" of relation "vacparted1" does not exist
 ANALYZE vactst, vactst;
 BEGIN;  -- ANALYZE behaves differently inside a transaction block
 ANALYZE vactst, vactst;
-- 
2.20.1

#96Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Álvaro Herrera (#95)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-13, Álvaro Herrera wrote:

Some doc changes are pending, and some more commentary in parts of the
code, but I think this is much more sensible. I do lament the lack of
a syscache for pg_inherits.

Thinking about this again, this one here is the killer problem, I think;
this behaves pretty horribly if you have more than one partition level,
because it'll have to do one indexscan *per level per partition*. (For
example, five partitions two levels down mean ten index scans). There's
no cache for this, and no way to disable it. So for situations with a
lot of partitions, it could be troublesome. Granted, it only needs to
be done for partitions with DML changes since the previous autovacuum
worker run in the affected database, but still it could be significant.

Now we could perhaps have a hash table in partition_analyze_report_ancestors()
to avoid the need for repeated indexscans for partitions of the same
hierarchy (an open-coded cache to take the place of the missing
pg_inherits syscache); and perhaps even use a single seqscan of
pg_inherits to capture the whole story first and then filter down to the
partitions that we were asked to process ... (so are we building a
mini-optimizer to determine which strategy to use in each case?).

That all sounds too much to be doing in the beta.

So I'm leaning towards the idea that we need to revert the patch and
start over for pg15.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)

#97Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Álvaro Herrera (#96)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)

Attachments:

0001-Revert-analyze-support-for-partitioned-tables.patchtext/x-diff; charset=utf-8Download
From cad5b710a531ec6eefc8856177c68d594c60ac8c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 16 Aug 2021 10:56:07 -0400
Subject: [PATCH] Revert analyze support for partitioned tables

This reverts the following commits:
1b5617eb844cd2470a334c1d2eec66cf9b39c41a Describe (auto-)analyze behavior for partitioned tables
0e69f705cc1a3df273b38c9883fb5765991e04fe Set pg_class.reltuples for partitioned tables
41badeaba8beee7648ebe7923a41c04f1f3cb302 Document ANALYZE storage parameters for partitioned tables
0827e8af70f4653ba17ed773f123a60eadd9f9c9 autovacuum: handle analyze for partitioned tables

There are efficiency issues in this code when handling databases with
large numbers of partitions, and it doesn't look like there isn't any
trivial way to handle those.  There are some other issues as well.  It's
now too late in the cycle for nontrivial fixes, so we'll have to let
Postgres 14 users continue to manually deal with ANALYZE their
partitioned tables, and hopefully we can fix the issues for Postgres 15.

I chose to keep [most of] be280cdad298 ("Don't reset relhasindex for
partitioned tables on ANALYZE") because while we added due to
0827e8af70f4, it is a reasonable change in its own right (since it
affects manual analyze as well as autovacuum-induced analyze) and
there's no reason to revert it.

I retained relkind 'p' in the definition of view pg_stat_user_tables,
because that change would require a catversion bump.
Also, in pg14 only, I keep a struct member that was added in
PgStat_TabStatEntry to avoid breaking compatibility with existing stat
files, because changing that would require a catversion bump.

Backpatch to 14.

Discussion: https://postgr.es/m/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de
---
 doc/src/sgml/maintenance.sgml          |   6 --
 doc/src/sgml/perform.sgml              |   3 +-
 doc/src/sgml/ref/analyze.sgml          |  40 +++------
 doc/src/sgml/ref/create_table.sgml     |   8 +-
 doc/src/sgml/ref/pg_restore.sgml       |   6 +-
 src/backend/access/common/reloptions.c |  15 ++--
 src/backend/commands/analyze.c         |  52 +++---------
 src/backend/commands/tablecmds.c       |  47 +----------
 src/backend/postmaster/autovacuum.c    |  66 +++------------
 src/backend/postmaster/pgstat.c        | 108 +++----------------------
 src/include/pgstat.h                   |  26 +-----
 11 files changed, 57 insertions(+), 320 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 998a48fc25..36f975b1e5 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -817,12 +817,6 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
 </programlisting>
     is compared to the total number of tuples inserted, updated, or deleted
     since the last <command>ANALYZE</command>.
-    For partitioned tables, inserts, updates and deletes on partitions
-    are counted towards this threshold; however, DDL
-    operations such as <literal>ATTACH</literal>, <literal>DETACH</literal>
-    and <literal>DROP</literal> are not, so running a manual
-    <command>ANALYZE</command> is recommended if the partition added or
-    removed contains a statistically significant volume of data.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ddd6c3ff3e..89ff58338e 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1767,8 +1767,7 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    <para>
     Whenever you have significantly altered the distribution of data
     within a table, running <link linkend="sql-analyze"><command>ANALYZE</command></link> is strongly recommended. This
-    includes bulk loading large amounts of data into the table as well as
-    attaching, detaching or dropping partitions.  Running
+    includes bulk loading large amounts of data into the table.  Running
     <command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)
     ensures that the planner has up-to-date statistics about the
     table.  With no statistics or obsolete statistics, the planner might
diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 176c7cb225..c8fcebc161 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -250,38 +250,20 @@ ANALYZE [ VERBOSE ] [ <replaceable class="parameter">table_and_columns</replacea
   </para>
 
   <para>
-   If the table being analyzed is partitioned, <command>ANALYZE</command>
-   will gather statistics by sampling blocks randomly from its partitions;
-   in addition, it will recurse into each partition and update its statistics.
-   (However, in multi-level partitioning scenarios, each leaf partition
-   will only be analyzed once.)
-   By contrast, if the table being analyzed has inheritance children,
-   <command>ANALYZE</command> will gather statistics for it twice:
-   once on the rows of the parent table only, and a second time on the
-   rows of the parent table with all of its children.  This second set of
-   statistics is needed when planning queries that traverse the entire
-   inheritance tree.  The child tables themselves are not individually
-   analyzed in this case.
+    If the table being analyzed has one or more children,
+    <command>ANALYZE</command> will gather statistics twice: once on the
+    rows of the parent table only, and a second time on the rows of the
+    parent table with all of its children.  This second set of statistics
+    is needed when planning queries that traverse the entire inheritance
+    tree.  The autovacuum daemon, however, will only consider inserts or
+    updates on the parent table itself when deciding whether to trigger an
+    automatic analyze for that table.  If that table is rarely inserted into
+    or updated, the inheritance statistics will not be up to date unless you
+    run <command>ANALYZE</command> manually.
   </para>
 
   <para>
-   The autovacuum daemon counts inserts, updates and deletes in the
-   partitions to determine if auto-analyze is needed.  However, adding
-   or removing partitions does not affect autovacuum daemon decisions,
-   so triggering a manual <command>ANALYZE</command> is recommended
-   when this occurs.
-  </para>
-
-  <para>
-   Tuples changed in inheritance children do not count towards analyze
-   on the parent table.  If the parent table is empty or rarely modified,
-   it may never be processed by autovacuum.  It's necessary to
-   periodically run a manual <command>ANALYZE</command> to keep the
-   statistics of the table hierarchy up to date.
-  </para>
-
-  <para>
-    If any of the child tables or partitions are foreign tables whose foreign data wrappers
+    If any of the child tables are foreign tables whose foreign data wrappers
     do not support <command>ANALYZE</command>, those child tables are ignored while
     gathering inheritance statistics.
   </para>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 15aed2f251..473a0a4aeb 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1374,8 +1374,8 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     If a table parameter value is set and the
     equivalent <literal>toast.</literal> parameter is not, the TOAST table
     will use the table's parameter value.
-    Except where noted, these parameters are not supported on partitioned
-    tables; however, you can specify them on individual leaf partitions.
+    Specifying these parameters for partitioned tables is not supported,
+    but you may specify them for individual leaf partitions.
    </para>
 
    <variablelist>
@@ -1457,8 +1457,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      If true, the autovacuum daemon will perform automatic <command>VACUUM</command>
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
-     This parameter can be set for partitioned tables to prevent autovacuum
-     from running <command>ANALYZE</command> on them.
      If false, this table will not be autovacuumed, except to prevent
      transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
      more about wraparound prevention.
@@ -1590,7 +1588,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      <para>
       Per-table value for <xref linkend="guc-autovacuum-analyze-threshold"/>
       parameter.
-      This parameter can be set for partitioned tables.
      </para>
     </listitem>
    </varlistentry>
@@ -1606,7 +1603,6 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      <para>
       Per-table value for <xref linkend="guc-autovacuum-analyze-scale-factor"/>
       parameter.
-      This parameter can be set for partitioned tables.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 35cd56297c..93ea937ac8 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -922,10 +922,8 @@ CREATE DATABASE foo WITH TEMPLATE template0;
 
   <para>
    Once restored, it is wise to run <command>ANALYZE</command> on each
-   restored table so the optimizer has useful statistics.
-   If the table is a partition or an inheritance child, it may also be useful
-   to analyze the parent to update statistics for the table hierarchy.
-   See <xref linkend="vacuum-for-statistics"/> and
+   restored table so the optimizer has useful statistics; see
+   <xref linkend="vacuum-for-statistics"/> and
    <xref linkend="autovacuum"/> for more information.
   </para>
 
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 7566265bcb..b5602f5323 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -108,7 +108,7 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST | RELOPT_KIND_PARTITIONED,
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
 			ShareUpdateExclusiveLock
 		},
 		true
@@ -237,7 +237,7 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_analyze_threshold",
 			"Minimum number of tuple inserts, updates or deletes prior to analyze",
-			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
+			RELOPT_KIND_HEAP,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0, INT_MAX
@@ -411,7 +411,7 @@ static relopt_real realRelOpts[] =
 		{
 			"autovacuum_analyze_scale_factor",
 			"Number of tuple inserts, updates or deletes prior to analyze as a fraction of reltuples",
-			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
+			RELOPT_KIND_HEAP,
 			ShareUpdateExclusiveLock
 		},
 		-1, 0.0, 100.0
@@ -1979,11 +1979,12 @@ bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
 	/*
-	 * autovacuum_enabled, autovacuum_analyze_threshold and
-	 * autovacuum_analyze_scale_factor are supported for partitioned tables.
+	 * There are no options for partitioned tables yet, but this is able to do
+	 * some validation.
 	 */
-
-	return default_reloptions(reloptions, validate, RELOPT_KIND_PARTITIONED);
+	return (bytea *) build_reloptions(reloptions, validate,
+									  RELOPT_KIND_PARTITIONED,
+									  0, NULL, 0);
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0099a04bbe..8d7b38d170 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -626,8 +626,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								 PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
 
 	/*
-	 * Update pages/tuples stats in pg_class ... but not if we're doing
-	 * inherited stats.
+	 * Update pages/tuples stats in pg_class, and report ANALYZE to the stats
+	 * collector ... but not if we're doing inherited stats.
 	 *
 	 * We assume that VACUUM hasn't set pg_class.reltuples already, even
 	 * during a VACUUM ANALYZE.  Although VACUUM often updates pg_class,
@@ -668,47 +668,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
-	}
-	else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		/*
-		 * Partitioned tables don't have storage, so we don't set any fields
-		 * in their pg_class entries except for reltuples, which is necessary
-		 * for auto-analyze to work properly, and relhasindex.
-		 */
-		vac_update_relstats(onerel, -1, totalrows,
-							0, hasindex, InvalidTransactionId,
-							InvalidMultiXactId,
-							in_outer_xact);
-	}
 
-	/*
-	 * Now report ANALYZE to the stats collector.  For regular tables, we do
-	 * it only if not doing inherited stats.  For partitioned tables, we only
-	 * do it for inherited stats. (We're never called for not-inherited stats
-	 * on partitioned tables anyway.)
-	 *
-	 * Reset the changes_since_analyze counter only if we analyzed all
-	 * columns; otherwise, there is still work for auto-analyze to do.
-	 */
-	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		/*
+		 * Now report ANALYZE to the stats collector.
+		 *
+		 * We deliberately don't report to the stats collector when doing
+		 * inherited stats, because the stats collector only tracks per-table
+		 * stats.
+		 *
+		 * Reset the changes_since_analyze counter only if we analyzed all
+		 * columns; otherwise, there is still work for auto-analyze to do.
+		 */
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
-
-	/*
-	 * If this is a manual analyze of all columns of a permanent leaf
-	 * partition, and not doing inherited stats, also let the collector know
-	 * about the ancestor tables of this partition.  Autovacuum does the
-	 * equivalent of this at the start of its run, so there's no reason to do
-	 * it there.
-	 */
-	if (!inh && !IsAutoVacuumWorkerProcess() &&
-		(va_cols == NIL) &&
-		onerel->rd_rel->relispartition &&
-		onerel->rd_rel->relkind == RELKIND_RELATION &&
-		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
-	{
-		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
 	}
 
 	/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6dae7e99ac..bd3e701ca3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -335,7 +335,6 @@ typedef struct ForeignTruncateInfo
 static void truncate_check_rel(Oid relid, Form_pg_class reltuple);
 static void truncate_check_perms(Oid relid, Form_pg_class reltuple);
 static void truncate_check_activity(Relation rel);
-static void truncate_update_partedrel_stats(List *parted_rels);
 static void RangeVarCallbackForTruncate(const RangeVar *relation,
 										Oid relId, Oid oldRelId, void *arg);
 static List *MergeAttributes(List *schema, List *supers, char relpersistence,
@@ -1739,7 +1738,6 @@ ExecuteTruncateGuts(List *explicit_rels,
 {
 	List	   *rels;
 	List	   *seq_relids = NIL;
-	List	   *parted_rels = NIL;
 	HTAB	   *ft_htab = NULL;
 	EState	   *estate;
 	ResultRelInfo *resultRelInfos;
@@ -1888,15 +1886,9 @@ ExecuteTruncateGuts(List *explicit_rels,
 	{
 		Relation	rel = (Relation) lfirst(cell);
 
-		/*
-		 * Save OID of partitioned tables for later; nothing else to do for
-		 * them here.
-		 */
+		/* Skip partitioned tables as there is nothing to do */
 		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		{
-			parted_rels = lappend_oid(parted_rels, RelationGetRelid(rel));
 			continue;
-		}
 
 		/*
 		 * Build the lists of foreign tables belonging to each foreign server
@@ -2044,9 +2036,6 @@ ExecuteTruncateGuts(List *explicit_rels,
 		ResetSequence(seq_relid);
 	}
 
-	/* Reset partitioned tables' pg_class.reltuples */
-	truncate_update_partedrel_stats(parted_rels);
-
 	/*
 	 * Write a WAL record to allow this set of actions to be logically
 	 * decoded.
@@ -2193,40 +2182,6 @@ truncate_check_activity(Relation rel)
 	CheckTableNotInUse(rel, "TRUNCATE");
 }
 
-/*
- * Update pg_class.reltuples for all the given partitioned tables to 0.
- */
-static void
-truncate_update_partedrel_stats(List *parted_rels)
-{
-	Relation	pg_class;
-	ListCell   *lc;
-
-	pg_class = table_open(RelationRelationId, RowExclusiveLock);
-
-	foreach(lc, parted_rels)
-	{
-		Oid			relid = lfirst_oid(lc);
-		HeapTuple	tuple;
-		Form_pg_class rd_rel;
-
-		tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
-		if (!HeapTupleIsValid(tuple))
-			elog(ERROR, "could not find tuple for relation %u", relid);
-		rd_rel = (Form_pg_class) GETSTRUCT(tuple);
-		if (rd_rel->reltuples != (float4) 0)
-		{
-			rd_rel->reltuples = (float4) 0;
-
-			heap_inplace_update(pg_class, tuple);
-		}
-
-		heap_freetuple(tuple);
-	}
-
-	table_close(pg_class, RowExclusiveLock);
-}
-
 /*
  * storage_name
  *	  returns the name corresponding to a typstorage/attstorage enum value
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 912ef9cb54..fefc07e108 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -75,7 +75,6 @@
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
-#include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
 #include "commands/vacuum.h"
 #include "lib/ilist.h"
@@ -1970,7 +1969,6 @@ do_autovacuum(void)
 	int			effective_multixact_freeze_max_age;
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
-	bool		updated = false;
 	int			i;
 
 	/*
@@ -2056,19 +2054,12 @@ do_autovacuum(void)
 	/*
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
-	 * We do this in three passes: First we let pgstat collector know about
-	 * the partitioned table ancestors of all partitions that have recently
-	 * acquired rows for analyze.  This informs the second pass about the
-	 * total number of tuple count in partitioning hierarchies.
-	 *
-	 * On the second pass, we collect the list of plain relations,
-	 * materialized views and partitioned tables.  On the third one we collect
-	 * TOAST tables.
-	 *
-	 * The reason for doing the third pass is that during it we want to use
-	 * the main relation's pg_class.reloptions entry if the TOAST table does
-	 * not have any, and we cannot obtain it unless we know beforehand what's
-	 * the main table OID.
+	 * We do this in two passes: on the first one we collect the list of plain
+	 * relations and materialized views, and on the second one we collect
+	 * TOAST tables. The reason for doing the second pass is that during it we
+	 * want to use the main relation's pg_class.reloptions entry if the TOAST
+	 * table does not have any, and we cannot obtain it unless we know
+	 * beforehand what's the main table OID.
 	 *
 	 * We need to check TOAST tables separately because in cases with short,
 	 * wide tables there might be proportionally much more activity in the
@@ -2077,44 +2068,7 @@ do_autovacuum(void)
 	relScan = table_beginscan_catalog(classRel, 0, NULL);
 
 	/*
-	 * First pass: before collecting the list of tables to vacuum, let stat
-	 * collector know about partitioned-table ancestors of each partition.
-	 */
-	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
-		Oid			relid = classForm->oid;
-		PgStat_StatTabEntry *tabentry;
-
-		/* Only consider permanent leaf partitions */
-		if (!classForm->relispartition ||
-			classForm->relkind == RELKIND_PARTITIONED_TABLE ||
-			classForm->relpersistence == RELPERSISTENCE_TEMP)
-			continue;
-
-		/*
-		 * No need to do this for partitions that haven't acquired any rows.
-		 */
-		tabentry = pgstat_fetch_stat_tabentry(relid);
-		if (tabentry &&
-			tabentry->changes_since_analyze -
-			tabentry->changes_since_analyze_reported > 0)
-		{
-			pgstat_report_anl_ancestors(relid);
-			updated = true;
-		}
-	}
-
-	/* Acquire fresh stats for the next passes, if needed */
-	if (updated)
-	{
-		autovac_refresh_stats();
-		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-		shared = pgstat_fetch_stat_dbentry(InvalidOid);
-	}
-
-	/*
-	 * On the second pass, we collect main tables to vacuum, and also the main
+	 * On the first pass, we collect main tables to vacuum, and also the main
 	 * table relid to TOAST relid mapping.
 	 */
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
@@ -2128,8 +2082,7 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW &&
-			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			classForm->relkind != RELKIND_MATVIEW)
 			continue;
 
 		relid = classForm->oid;
@@ -2204,7 +2157,7 @@ do_autovacuum(void)
 
 	table_endscan(relScan);
 
-	/* third pass: check TOAST tables */
+	/* second pass: check TOAST tables */
 	ScanKeyInit(&key,
 				Anum_pg_class_relkind,
 				BTEqualStrategyNumber, F_CHAREQ,
@@ -2797,7 +2750,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
-		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_PARTITIONED_TABLE ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
 	relopts = extractRelOptions(tup, pg_class_desc, NULL);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ce8888cc30..7fcc3f6ded 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,7 +38,6 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
-#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -345,7 +344,6 @@ static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg
 static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
@@ -1599,9 +1597,6 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
- * Exceptional support only changes_since_analyze for partitioned tables,
- * though they don't have any data.  This counter will tell us whether
- * partitioned tables need autoanalyze or not.
  * --------
  */
 void
@@ -1623,31 +1618,21 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
 	 * committing.)
-	 *
-	 * For partitioned tables, we don't report live and dead tuples, because
-	 * such tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			/* If this rel is partitioned, skip modifying */
-			livetuples = deadtuples = 0;
-		else
+		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
 		{
-			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
-			{
-				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
-			}
-			/* count stuff inserted by already-aborted subxacts, too */
-			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-			/* Since ANALYZE's counts are estimates, we could have underflowed */
-			livetuples = Max(livetuples, 0);
-			deadtuples = Max(deadtuples, 0);
+			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
 		}
-
+		/* count stuff inserted by already-aborted subxacts, too */
+		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+		/* Since ANALYZE's counts are estimates, we could have underflowed */
+		livetuples = Max(livetuples, 0);
+		deadtuples = Max(deadtuples, 0);
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1659,48 +1644,6 @@ pgstat_report_analyze(Relation rel,
 	msg.m_live_tuples = livetuples;
 	msg.m_dead_tuples = deadtuples;
 	pgstat_send(&msg, sizeof(msg));
-
-}
-
-/*
- * pgstat_report_anl_ancestors
- *
- *	Send list of partitioned table ancestors of the given partition to the
- *	collector.  The collector is in charge of propagating the analyze tuple
- *	counts from the partition to its ancestors.  This is necessary so that
- *	other processes can decide whether to analyze the partitioned tables.
- */
-void
-pgstat_report_anl_ancestors(Oid relid)
-{
-	PgStat_MsgAnlAncestors msg;
-	List	   *ancestors;
-	ListCell   *lc;
-
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANL_ANCESTORS);
-	msg.m_databaseid = MyDatabaseId;
-	msg.m_tableoid = relid;
-	msg.m_nancestors = 0;
-
-	ancestors = get_partition_ancestors(relid);
-	foreach(lc, ancestors)
-	{
-		Oid			ancestor = lfirst_oid(lc);
-
-		msg.m_ancestors[msg.m_nancestors] = ancestor;
-		if (++msg.m_nancestors >= PGSTAT_NUM_ANCESTORENTRIES)
-		{
-			pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-						msg.m_nancestors * sizeof(Oid));
-			msg.m_nancestors = 0;
-		}
-	}
-
-	if (msg.m_nancestors > 0)
-		pgstat_send(&msg, offsetof(PgStat_MsgAnlAncestors, m_ancestors[0]) +
-					msg.m_nancestors * sizeof(Oid));
-
-	list_free(ancestors);
 }
 
 /* --------
@@ -2039,8 +1982,7 @@ pgstat_initstats(Relation rel)
 	char		relkind = rel->rd_rel->relkind;
 
 	/* We only count stats for things that have storage */
-	if (!RELKIND_HAS_STORAGE(relkind) &&
-		relkind != RELKIND_PARTITIONED_TABLE)
+	if (!RELKIND_HAS_STORAGE(relkind))
 	{
 		rel->pgstat_info = NULL;
 		return;
@@ -3370,10 +3312,6 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_analyze(&msg.msg_analyze, len);
 					break;
 
-				case PGSTAT_MTYPE_ANL_ANCESTORS:
-					pgstat_recv_anl_ancestors(&msg.msg_anl_ancestors, len);
-					break;
-
 				case PGSTAT_MTYPE_ARCHIVER:
 					pgstat_recv_archiver(&msg.msg_archiver, len);
 					break;
@@ -3588,7 +3526,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 		result->n_live_tuples = 0;
 		result->n_dead_tuples = 0;
 		result->changes_since_analyze = 0;
-		result->changes_since_analyze_reported = 0;
 		result->inserts_since_vacuum = 0;
 		result->blocks_fetched = 0;
 		result->blocks_hit = 0;
@@ -4870,7 +4807,6 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-			tabentry->changes_since_analyze_reported = 0;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
 			tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
@@ -5268,10 +5204,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 * have no good way to estimate how many of those there were.
 	 */
 	if (msg->m_resetcounter)
-	{
 		tabentry->changes_since_analyze = 0;
-		tabentry->changes_since_analyze_reported = 0;
-	}
 
 	if (msg->m_autovacuum)
 	{
@@ -5285,29 +5218,6 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	}
 }
 
-static void
-pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len)
-{
-	PgStat_StatDBEntry *dbentry;
-	PgStat_StatTabEntry *tabentry;
-
-	dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-	tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-	for (int i = 0; i < msg->m_nancestors; i++)
-	{
-		Oid			ancestor_relid = msg->m_ancestors[i];
-		PgStat_StatTabEntry *ancestor;
-
-		ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true);
-		ancestor->changes_since_analyze +=
-			tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported;
-	}
-
-	tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
-
-}
 
 /* ----------
  * pgstat_recv_archiver() -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9612c0a6c2..f779b48b8c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -69,7 +69,6 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_AUTOVAC_START,
 	PGSTAT_MTYPE_VACUUM,
 	PGSTAT_MTYPE_ANALYZE,
-	PGSTAT_MTYPE_ANL_ANCESTORS,
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
@@ -107,7 +106,7 @@ typedef int64 PgStat_Counter;
  *
  * tuples_inserted/updated/deleted/hot_updated count attempted actions,
  * regardless of whether the transaction committed.  delta_live_tuples,
- * delta_dead_tuples, changed_tuples are set depending on commit or abort.
+ * delta_dead_tuples, and changed_tuples are set depending on commit or abort.
  * Note that delta_live_tuples and delta_dead_tuples can be negative!
  * ----------
  */
@@ -430,25 +429,6 @@ typedef struct PgStat_MsgAnalyze
 	PgStat_Counter m_dead_tuples;
 } PgStat_MsgAnalyze;
 
-/* ----------
- * PgStat_MsgAnlAncestors		Sent by the backend or autovacuum daemon
- *								to inform partitioned tables that are
- *								ancestors of a partition, to propagate
- *								analyze counters
- * ----------
- */
-#define PGSTAT_NUM_ANCESTORENTRIES    \
-	((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(int))	\
-	 / sizeof(Oid))
-
-typedef struct PgStat_MsgAnlAncestors
-{
-	PgStat_MsgHdr m_hdr;
-	Oid			m_databaseid;
-	Oid			m_tableoid;
-	int			m_nancestors;
-	Oid			m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
-} PgStat_MsgAnlAncestors;
 
 /* ----------
  * PgStat_MsgArchiver			Sent by the archiver to update statistics.
@@ -697,7 +677,6 @@ typedef union PgStat_Msg
 	PgStat_MsgAutovacStart msg_autovacuum_start;
 	PgStat_MsgVacuum msg_vacuum;
 	PgStat_MsgAnalyze msg_analyze;
-	PgStat_MsgAnlAncestors msg_anl_ancestors;
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
@@ -793,7 +772,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter n_live_tuples;
 	PgStat_Counter n_dead_tuples;
 	PgStat_Counter changes_since_analyze;
-	PgStat_Counter changes_since_analyze_reported;
+	PgStat_Counter unused_counter;	/* kept for ABI compatibility */
 	PgStat_Counter inserts_since_vacuum;
 
 	PgStat_Counter blocks_fetched;
@@ -1002,7 +981,6 @@ extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
-extern void pgstat_report_anl_ancestors(Oid relid);
 
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
-- 
2.20.1

#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Álvaro Herrera (#97)
Re: Autovacuum on partitioned table (autoanalyze)

=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I don't follow the connection to catversion?

I agree that we probably don't want to change PgStat_StatTabEntry in
v14 at this point. But it'd be a good idea to attach a comment to
the entry saying it's unused but left there for ABI reasons.

regards, tom lane

#99Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#98)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-16, Tom Lane wrote:

=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I don't follow the connection to catversion?

Sorry, I misspoke -- I mean PGSTAT_FORMAT_FILE_ID. I shouldn't just
change it, since if I do then the file is reported as corrupted and all
counters are lost. So in the posted patch I did as you suggest:

I agree that we probably don't want to change PgStat_StatTabEntry in
v14 at this point. But it'd be a good idea to attach a comment to
the entry saying it's unused but left there for ABI reasons.

It's only in branch master that I'd change the pgstat format version and
remove the field. This is what I meant with the patch being for v14 and
a tweak needed for this in master.

A catversion bump would be required to change the definition of
pg_stat_user_tables, which the patch being reverted originally changed
to include relkind 'p'. A straight revert would remove that, but in my
reversal patch I chose to keep it in place.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)

#100Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Álvaro Herrera (#99)
Re: Autovacuum on partitioned table (autoanalyze)

Another possible problem is that before the revert, we accept
ALTER TABLE some_partitioned_table SET (autovacuum_enabled=on/off);
(also autovacuum_analyze_scale_factor and autovacuum_analyze_threshold)
but after the revert this is will throw a syntax error. What do people
think we should do about that?

1. Do nothing. If somebody finds in that situation, they can use
ALTER TABLE .. RESET ...
to remove the settings.

2. Silently accept the option and do nothing.
3. Accept the option and throw a warning that it's a no-op.
4. Something else

Opinions?

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

#101Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Álvaro Herrera (#97)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-16, Álvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

It upsets me to have reverted it, but after spending so much time trying
to correct the problems, I believe it just wasn't salvageable within the
beta-period code freeze constraints. I described the issues I ran into
in earlier messages; I think a good starting point to re-develop this is
to revert the reversal commit, then apply my patch at
/messages/by-id/0794d7ca-5183-486b-9c5e-6d434867cecd@www.fastmail.com
then do something about the remaining problems that were complained
about. (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so
that the collector knows to propagate counts from children to ancestors
when the upd/ins/del counts are received. However, consider developing
it as follow-up to Horiguchi-san's shmem pgstat rather than current
pgstat implementation.)

Thanks

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#102Justin Pryzby
pryzby@telsasoft.com
In reply to: Álvaro Herrera (#101)
Re: Autovacuum on partitioned table (autoanalyze)

On Mon, Aug 16, 2021 at 05:42:48PM -0400, �lvaro Herrera wrote:

On 2021-Aug-16, �lvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

| I retained the addition of relkind 'p' to tables included by
| pg_stat_user_tables, because reverting that would require a catversion
| bump.

Right now, on v15dev, it shows 0, which is misleading.
Shouldn't it be null ?

analyze_count | 0

Note that having analyze_count and last_analyze would be an an independently
useful change. Since parent tables aren't analyzed automatically, I have a
script to periodically process them if they weren't processed recently. Right
now, for partitioned tables, the best I could find is to check its partitions:
| MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid

In 20200418050815.GE26953@telsasoft.com I wrote:
|This patch includes partitioned tables in pg_stat_*_tables, which is great; I
|complained awhile ago that they were missing [0]. It might be useful if that
|part was split out into a separate 0001 patch (?).
| [0] /messages/by-id/20180601221428.GU5164@telsasoft.com

--
Justin

#103Andres Freund
andres@anarazel.de
In reply to: Álvaro Herrera (#101)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-08-16 17:42:48 -0400, �lvaro Herrera wrote:

On 2021-Aug-16, �lvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

It upsets me to have reverted it, but after spending so much time trying
to correct the problems, I believe it just wasn't salvageable within the
beta-period code freeze constraints.

:(

I described the issues I ran into
in earlier messages; I think a good starting point to re-develop this is
to revert the reversal commit, then apply my patch at
/messages/by-id/0794d7ca-5183-486b-9c5e-6d434867cecd@www.fastmail.com
then do something about the remaining problems that were complained
about. (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so
that the collector knows to propagate counts from children to ancestors
when the upd/ins/del counts are received.

My suspicion is that it'd be a lot easier to implement this efficiently if
there were no propagation done outside of actually analyzing tables. I.e. have
do_autovacuum() build a hashtable of (parent_table_id, count) and use that to
make the analyze decisions. And then only propagate up the costs to parents of
tables when a child is analyzed (and thus looses its changes_since_analyze)
value. Then we can use hashtable_value + changes_since_analyze for
partitioning decisions of partitioned tables.

I've prototyped this, and it does seem to make do_autovacuum() cheaper. I've
attached that prototype, but note it's in a rough state.

However, unless we change the way inheritance parents are stored, it still
requires repetitive get_partition_ancestors() (or get_partition_parent())
calls in do_autovacuum(), which I think is problematic due to the index scans
you pointed out as well. The obvious way to address that would be to store
parent oids in pg_class - I suspect duplicating parents in pg_class is the
best way out, but pretty it is not.

However, consider developing it as follow-up to Horiguchi-san's shmem
pgstat rather than current pgstat implementation.)

+1

It might be worth to first tackle reusing samples from a relation's children
when building inheritance stats. Either by storing the samples somewhere (not
cheap) and reusing them, or by at least updating a partition's stats when
analyzing the parent.

Greetings,

Andres Freund

Attachments:

autovac-partitioned-via-hash.difftext/x-diff; charset=us-asciiDownload
commit ec796bd8ee2970e2eae3b3839e1bb96696393dc7
Author: Andres Freund <andres@anarazel.de>
Date:   2021-07-30 17:20:21 -0700

    tmp
    
    Author:
    Reviewed-By:
    Discussion: https://postgr.es/m/
    Backpatch:

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0c9591415e4..df021215281 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -320,12 +320,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	PgStat_Counter startwritetime = 0;
 
 	if (inh)
-		ereport(elevel,
+		ereport(LOG,
 				(errmsg("analyzing \"%s.%s\" inheritance tree",
 						get_namespace_name(RelationGetNamespace(onerel)),
 						RelationGetRelationName(onerel))));
 	else
-		ereport(elevel,
+		ereport(LOG,
 				(errmsg("analyzing \"%s.%s\"",
 						get_namespace_name(RelationGetNamespace(onerel)),
 						RelationGetRelationName(onerel))));
@@ -682,6 +682,18 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							in_outer_xact);
 	}
 
+	/*
+	 * Let the collector know about the ancestor tables of this partition.
+	 */
+	if (!inh &&
+		(va_cols == NIL) &&
+		onerel->rd_rel->relispartition &&
+		onerel->rd_rel->relkind == RELKIND_RELATION &&
+		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+	{
+		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
+	}
+
 	/*
 	 * Now report ANALYZE to the stats collector.  For regular tables, we do
 	 * it only if not doing inherited stats.  For partitioned tables, we only
@@ -695,22 +707,6 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
-	/*
-	 * If this is a manual analyze of all columns of a permanent leaf
-	 * partition, and not doing inherited stats, also let the collector know
-	 * about the ancestor tables of this partition.  Autovacuum does the
-	 * equivalent of this at the start of its run, so there's no reason to do
-	 * it there.
-	 */
-	if (!inh && !IsAutoVacuumWorkerProcess() &&
-		(va_cols == NIL) &&
-		onerel->rd_rel->relispartition &&
-		onerel->rd_rel->relkind == RELKIND_RELATION &&
-		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
-	{
-		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
-	}
-
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
 	 *
@@ -1183,6 +1179,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData prefetch_bs;
 #endif
 
+	elog(LOG, "acquiring %d sample rows for %s", targrows,
+		 NameStr(onerel->rd_rel->relname));
 	Assert(targrows > 0);
 
 	totalblocks = RelationGetNumberOfBlocks(onerel);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 912ef9cb54c..73a872371d1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -74,6 +74,7 @@
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
 #include "commands/dbcommands.h"
@@ -327,16 +328,21 @@ static void do_autovacuum(void);
 static void FreeWorkerInfo(int code, Datum arg);
 
 static autovac_table *table_recheck_autovac(Oid relid, HTAB *table_toast_map,
+											HTAB *table_partition_info_map,
 											TupleDesc pg_class_desc,
-											int effective_multixact_freeze_max_age);
-static void recheck_relation_needs_vacanalyze(Oid relid, AutoVacOpts *avopts,
+											int effective_multixact_freeze_max_age,
+											PgStat_Counter additional_analyze_rows);
+static void recheck_relation_needs_vacanalyze(Oid relid,
+											  AutoVacOpts *avopts,
 											  Form_pg_class classForm,
 											  int effective_multixact_freeze_max_age,
+											  PgStat_Counter additional_analyze_rows,
 											  bool *dovacuum, bool *doanalyze, bool *wraparound);
 static void relation_needs_vacanalyze(Oid relid, AutoVacOpts *relopts,
 									  Form_pg_class classForm,
 									  PgStat_StatTabEntry *tabentry,
 									  int effective_multixact_freeze_max_age,
+									  PgStat_Counter additional_analyze_rows,
 									  bool *dovacuum, bool *doanalyze, bool *wraparound);
 
 static void autovacuum_do_vac_analyze(autovac_table *tab,
@@ -1944,6 +1950,12 @@ get_database_list(void)
 	return dblist;
 }
 
+typedef struct AutovacPartitionInfo
+{
+	Oid partitioned_table_oid;
+	PgStat_Counter changes_since_analyze;
+} AutovacPartitionInfo;
+
 /*
  * Process a database table-by-table
  *
@@ -1961,16 +1973,15 @@ do_autovacuum(void)
 	List	   *orphan_oids = NIL;
 	HASHCTL		ctl;
 	HTAB	   *table_toast_map;
+	HTAB	   *partition_info_map;
 	ListCell   *volatile cell;
 	PgStat_StatDBEntry *shared;
 	PgStat_StatDBEntry *dbentry;
 	BufferAccessStrategy bstrategy;
-	ScanKeyData key;
 	TupleDesc	pg_class_desc;
 	int			effective_multixact_freeze_max_age;
 	bool		did_vacuum = false;
 	bool		found_concurrent_worker = false;
-	bool		updated = false;
 	int			i;
 
 	/*
@@ -2053,19 +2064,26 @@ do_autovacuum(void)
 								  &ctl,
 								  HASH_ELEM | HASH_BLOBS);
 
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(AutovacPartitionInfo);
+
+	partition_info_map = hash_create("Autovacuum Partitioned Table Rowcount Hash",
+									 100,
+									 &ctl,
+									 HASH_ELEM | HASH_BLOBS);
+
+
 	/*
 	 * Scan pg_class to determine which tables to vacuum.
 	 *
-	 * We do this in three passes: First we let pgstat collector know about
-	 * the partitioned table ancestors of all partitions that have recently
-	 * acquired rows for analyze.  This informs the second pass about the
-	 * total number of tuple count in partitioning hierarchies.
+	 * We do this in two passes:
 	 *
-	 * On the second pass, we collect the list of plain relations,
-	 * materialized views and partitioned tables.  On the third one we collect
-	 * TOAST tables.
+	 * On the first pass, we collect the list of plain relations, materialized
+	 * views and partitioned tables. For partitions we sum up the sum of the
+	 * changes in partitioned tables.  On the second pass one we collect TOAST
+	 * and partitioned tables tables.
 	 *
-	 * The reason for doing the third pass is that during it we want to use
+	 * The reason for doing the second pass is that during it we want to use
 	 * the main relation's pg_class.reloptions entry if the TOAST table does
 	 * not have any, and we cannot obtain it unless we know beforehand what's
 	 * the main table OID.
@@ -2077,44 +2095,7 @@ do_autovacuum(void)
 	relScan = table_beginscan_catalog(classRel, 0, NULL);
 
 	/*
-	 * First pass: before collecting the list of tables to vacuum, let stat
-	 * collector know about partitioned-table ancestors of each partition.
-	 */
-	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
-	{
-		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
-		Oid			relid = classForm->oid;
-		PgStat_StatTabEntry *tabentry;
-
-		/* Only consider permanent leaf partitions */
-		if (!classForm->relispartition ||
-			classForm->relkind == RELKIND_PARTITIONED_TABLE ||
-			classForm->relpersistence == RELPERSISTENCE_TEMP)
-			continue;
-
-		/*
-		 * No need to do this for partitions that haven't acquired any rows.
-		 */
-		tabentry = pgstat_fetch_stat_tabentry(relid);
-		if (tabentry &&
-			tabentry->changes_since_analyze -
-			tabentry->changes_since_analyze_reported > 0)
-		{
-			pgstat_report_anl_ancestors(relid);
-			updated = true;
-		}
-	}
-
-	/* Acquire fresh stats for the next passes, if needed */
-	if (updated)
-	{
-		autovac_refresh_stats();
-		dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-		shared = pgstat_fetch_stat_dbentry(InvalidOid);
-	}
-
-	/*
-	 * On the second pass, we collect main tables to vacuum, and also the main
+	 * On the first pass, we collect main tables to vacuum, and also the main
 	 * table relid to TOAST relid mapping.
 	 */
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
@@ -2128,8 +2109,7 @@ do_autovacuum(void)
 		bool		wraparound;
 
 		if (classForm->relkind != RELKIND_RELATION &&
-			classForm->relkind != RELKIND_MATVIEW &&
-			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			classForm->relkind != RELKIND_MATVIEW)
 			continue;
 
 		relid = classForm->oid;
@@ -2167,6 +2147,7 @@ do_autovacuum(void)
 		/* Check if it needs vacuum or analyze */
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
+								  0,
 								  &dovacuum, &doanalyze, &wraparound);
 
 		/* Relations that need work are added to table_oids */
@@ -2200,17 +2181,40 @@ do_autovacuum(void)
 				}
 			}
 		}
+
+		/* Sum up pending changes to leaf partitions in parent partition */
+		if (classForm->relispartition &&
+			classForm->relpersistence != RELPERSISTENCE_TEMP &&
+			(tabentry && tabentry->changes_since_analyze > 0))
+		{
+			List *ancestors = get_partition_ancestors(classForm->oid);
+			ListCell   *lc;
+
+			foreach(lc, ancestors)
+			{
+				Oid			ancestor_oid = lfirst_oid(lc);
+				AutovacPartitionInfo *ancestor;
+				bool found;
+
+				ancestor = hash_search(partition_info_map,
+									   &ancestor_oid,
+									   HASH_ENTER, &found);
+				if (!found)
+					ancestor->changes_since_analyze = 0;
+				ancestor->changes_since_analyze += tabentry->changes_since_analyze;
+
+				elog(LOG, "reporting up %lu changes from %u to ancestor %u",
+					 tabentry->changes_since_analyze, relid, ancestor_oid);
+			}
+
+			continue;
+		}
 	}
 
 	table_endscan(relScan);
 
-	/* third pass: check TOAST tables */
-	ScanKeyInit(&key,
-				Anum_pg_class_relkind,
-				BTEqualStrategyNumber, F_CHAREQ,
-				CharGetDatum(RELKIND_TOASTVALUE));
-
-	relScan = table_beginscan_catalog(classRel, 1, &key);
+	/* second pass: check TOAST and partitioned tables */
+	relScan = table_beginscan_catalog(classRel, 0, NULL);
 	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
 	{
 		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
@@ -2221,6 +2225,10 @@ do_autovacuum(void)
 		bool		doanalyze;
 		bool		wraparound;
 
+		if (classForm->relkind != RELKIND_TOASTVALUE &&
+			classForm->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
 		/*
 		 * We cannot safely process other backends' temp tables, so skip 'em.
 		 */
@@ -2228,33 +2236,63 @@ do_autovacuum(void)
 			continue;
 
 		relid = classForm->oid;
-
-		/*
-		 * fetch reloptions -- if this toast table does not have them, try the
-		 * main rel
-		 */
 		relopts = extract_autovac_opts(tuple, pg_class_desc);
-		if (relopts == NULL)
+
+		if (classForm->relkind == RELKIND_TOASTVALUE)
 		{
-			av_relation *hentry;
-			bool		found;
+			/*
+			 * If this toast table does not have a reloption, try the main rel
+			 */
+			if (relopts == NULL)
+			{
+				av_relation *hentry;
+				bool		found;
 
-			hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
-			if (found && hentry->ar_hasrelopts)
-				relopts = &hentry->ar_reloptions;
+				hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found);
+				if (found && hentry->ar_hasrelopts)
+					relopts = &hentry->ar_reloptions;
+			}
+
+			/* Fetch the pgstat entry for this table */
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
+
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  0,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			/* ignore analyze for toast tables */
+			if (dovacuum)
+				table_oids = lappend_oid(table_oids, relid);
 		}
+		else if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			AutovacPartitionInfo *partition_info;
+			PgStat_Counter additional_analyze_rows = 0;
+			bool found;
 
-		/* Fetch the pgstat entry for this table */
-		tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-											 shared, dbentry);
+			tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
+												 shared, dbentry);
 
-		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
-								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+			partition_info = hash_search(partition_info_map, &relid, HASH_FIND, &found);
+			if (found)
+			{
+				elog(LOG, "additional changes to %u: %llu",
+					 relid, (long long unsigned) partition_info->changes_since_analyze);
+				additional_analyze_rows = partition_info->changes_since_analyze;
+			}
 
-		/* ignore analyze for toast tables */
-		if (dovacuum)
-			table_oids = lappend_oid(table_oids, relid);
+			relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
+									  effective_multixact_freeze_max_age,
+									  additional_analyze_rows,
+									  &dovacuum, &doanalyze, &wraparound);
+
+			Assert(!dovacuum && !wraparound);
+
+			if (doanalyze)
+				table_oids = lappend_oid(table_oids, relid);
+		}
 	}
 
 	table_endscan(relScan);
@@ -2369,12 +2407,14 @@ do_autovacuum(void)
 	{
 		Oid			relid = lfirst_oid(cell);
 		HeapTuple	classTup;
+		Form_pg_class classForm;
 		autovac_table *tab;
 		bool		isshared;
 		bool		skipit;
 		double		stdVacuumCostDelay;
 		int			stdVacuumCostLimit;
 		dlist_iter	iter;
+		PgStat_Counter additional_analyze_rows = 0;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -2405,7 +2445,8 @@ do_autovacuum(void)
 		classTup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
 		if (!HeapTupleIsValid(classTup))
 			continue;			/* somebody deleted the rel, forget it */
-		isshared = ((Form_pg_class) GETSTRUCT(classTup))->relisshared;
+		classForm = (Form_pg_class) GETSTRUCT(classTup);
+		isshared = classForm->relisshared;
 		ReleaseSysCache(classTup);
 
 		/*
@@ -2467,9 +2508,27 @@ do_autovacuum(void)
 		 * that somebody just finished vacuuming this table.  The window to
 		 * the race condition is not closed but it is very small.
 		 */
+
+		if (classForm->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			AutovacPartitionInfo *partition_info;
+			bool found;
+
+			partition_info = hash_search(partition_info_map, &relid, HASH_FIND, &found);
+			if (found)
+			{
+				elog(LOG, "additional changes to %u: %llu",
+					 relid, (long long unsigned) partition_info->changes_since_analyze);
+				additional_analyze_rows = partition_info->changes_since_analyze;
+			}
+		}
+
 		MemoryContextSwitchTo(AutovacMemCxt);
-		tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc,
-									effective_multixact_freeze_max_age);
+		tab = table_recheck_autovac(relid,
+									table_toast_map, partition_info_map,
+									pg_class_desc,
+									effective_multixact_freeze_max_age,
+									additional_analyze_rows);
 		if (tab == NULL)
 		{
 			/* someone else vacuumed the table, or it went away */
@@ -2845,8 +2904,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
  */
 static autovac_table *
 table_recheck_autovac(Oid relid, HTAB *table_toast_map,
+					  HTAB *partition_info_map,
 					  TupleDesc pg_class_desc,
-					  int effective_multixact_freeze_max_age)
+					  int effective_multixact_freeze_max_age,
+					  PgStat_Counter additional_analyze_rows)
 {
 	Form_pg_class classForm;
 	HeapTuple	classTup;
@@ -2895,6 +2956,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 	{
 		recheck_relation_needs_vacanalyze(relid, avopts, classForm,
 										  effective_multixact_freeze_max_age,
+										  additional_analyze_rows,
 										  &dovacuum, &doanalyze, &wraparound);
 
 		/* Quick exit if a relation doesn't need to be vacuumed or analyzed */
@@ -2910,6 +2972,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 
 	recheck_relation_needs_vacanalyze(relid, avopts, classForm,
 									  effective_multixact_freeze_max_age,
+									  additional_analyze_rows,
 									  &dovacuum, &doanalyze, &wraparound);
 
 	/* OK, it needs something done */
@@ -3039,6 +3102,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
 								  AutoVacOpts *avopts,
 								  Form_pg_class classForm,
 								  int effective_multixact_freeze_max_age,
+								  PgStat_Counter additional_analyze_rows,
 								  bool *dovacuum,
 								  bool *doanalyze,
 								  bool *wraparound)
@@ -3058,6 +3122,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
 
 	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
 							  effective_multixact_freeze_max_age,
+							  additional_analyze_rows,
 							  dovacuum, doanalyze, wraparound);
 
 	/* ignore ANALYZE for toast tables */
@@ -3108,6 +3173,7 @@ relation_needs_vacanalyze(Oid relid,
 						  Form_pg_class classForm,
 						  PgStat_StatTabEntry *tabentry,
 						  int effective_multixact_freeze_max_age,
+						  PgStat_Counter additional_analyze_rows,
  /* output params below */
 						  bool *dovacuum,
 						  bool *doanalyze,
@@ -3223,7 +3289,7 @@ relation_needs_vacanalyze(Oid relid,
 		reltuples = classForm->reltuples;
 		vactuples = tabentry->n_dead_tuples;
 		instuples = tabentry->inserts_since_vacuum;
-		anltuples = tabentry->changes_since_analyze;
+		anltuples = tabentry->changes_since_analyze + additional_analyze_rows;
 
 		/* If the table hasn't yet been vacuumed, take reltuples as zero */
 		if (reltuples < 0)
@@ -3252,6 +3318,14 @@ relation_needs_vacanalyze(Oid relid,
 			(vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
 		*doanalyze = (anltuples > anlthresh);
 	}
+	else if (additional_analyze_rows && AutoVacuumingActive())
+	{
+		anlthresh = (float4) anl_base_thresh;
+		anltuples = additional_analyze_rows;
+		*dovacuum = false;
+		*wraparound = false;
+		*doanalyze = (anltuples > anlthresh);
+	}
 	else
 	{
 		/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 11702f2a804..d1e9fd3f75c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4870,6 +4870,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 			tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
 			tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
 			tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
+			//tabentry->changes_since_analyze_reported = tabmsg->t_counts.t_changed_tuples;
 			tabentry->changes_since_analyze_reported = 0;
 			tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
 			tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
@@ -5269,8 +5270,11 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 	 */
 	if (msg->m_resetcounter)
 	{
+		elog(LOG, "resetting change_since_analyze of %u, was %lu/%lu",
+			 msg->m_tableoid, tabentry->changes_since_analyze,
+			 tabentry->changes_since_analyze_reported);
+		tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
 		tabentry->changes_since_analyze = 0;
-		tabentry->changes_since_analyze_reported = 0;
 	}
 
 	if (msg->m_autovacuum)
@@ -5301,12 +5305,10 @@ pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len)
 		PgStat_StatTabEntry *ancestor;
 
 		ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true);
-		ancestor->changes_since_analyze +=
-			tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported;
+		elog(LOG, "anl increase to %u of %lu",
+			 ancestor_relid, tabentry->changes_since_analyze);
+		ancestor->changes_since_analyze += tabentry->changes_since_analyze;
 	}
-
-	tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze;
-
 }
 
 /* ----------
#104Andres Freund
andres@anarazel.de
In reply to: Álvaro Herrera (#100)
Re: Autovacuum on partitioned table (autoanalyze)

Hi,

On 2021-08-16 13:13:55 -0400, �lvaro Herrera wrote:

Another possible problem is that before the revert, we accept
ALTER TABLE some_partitioned_table SET (autovacuum_enabled=on/off);
(also autovacuum_analyze_scale_factor and autovacuum_analyze_threshold)
but after the revert this is will throw a syntax error. What do people
think we should do about that?

1. Do nothing. If somebody finds in that situation, they can use
ALTER TABLE .. RESET ...
to remove the settings.

2. Silently accept the option and do nothing.
3. Accept the option and throw a warning that it's a no-op.
4. Something else

1) seems OK to me.

Greetings,

Andres Freund

#105Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#102)
1 attachment(s)
Re: Autovacuum on partitioned table (autoanalyze)

On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:42:48PM -0400, �lvaro Herrera wrote:

On 2021-Aug-16, �lvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

| I retained the addition of relkind 'p' to tables included by
| pg_stat_user_tables, because reverting that would require a catversion
| bump.

Right now, on v15dev, it shows 0, which is misleading.
Shouldn't it be null ?

analyze_count | 0

Note that having analyze_count and last_analyze would be an an independently
useful change. Since parent tables aren't analyzed automatically, I have a
script to periodically process them if they weren't processed recently. Right
now, for partitioned tables, the best I could find is to check its partitions:
| MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid

In 20200418050815.GE26953@telsasoft.com I wrote:
|This patch includes partitioned tables in pg_stat_*_tables, which is great; I
|complained awhile ago that they were missing [0]. It might be useful if that
|part was split out into a separate 0001 patch (?).
| [0] /messages/by-id/20180601221428.GU5164@telsasoft.com

I suggest the attached (which partially reverts the revert), to allow showing
correct data for analyze_count and last_analyzed.

Arguably these should be reported as null in v14 for partitioned tables, since
they're not "known to be zero", but rather "currently unpopulated".

n_mod_since_analyze | 0
n_ins_since_vacuum | 0

Justin

Attachments:

0001-Report-last_analyze-and-analyze_count-of-partitioned.patchtext/x-diff; charset=us-asciiDownload
From 0d0e149727d89115803b4528e15f5b3c04bd816b Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Mon, 16 Aug 2021 22:55:06 -0500
Subject: [PATCH] Report last_analyze and analyze_count of partitioned tables..

In v14, partitioned tables are included, but these fields are being reported as
zero, which is misleading.
---
 src/backend/commands/analyze.c  | 36 ++++++++++++++++++++++-----------
 src/backend/postmaster/pgstat.c | 27 +++++++++++++++++--------
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8d7b38d170..0050df08f6 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -626,8 +626,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								 PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
 
 	/*
-	 * Update pages/tuples stats in pg_class, and report ANALYZE to the stats
-	 * collector ... but not if we're doing inherited stats.
+	 * Update pages/tuples stats in pg_class ... but not if we're doing
+	 * inherited stats.
 	 *
 	 * We assume that VACUUM hasn't set pg_class.reltuples already, even
 	 * during a VACUUM ANALYZE.  Although VACUUM often updates pg_class,
@@ -668,20 +668,32 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								InvalidMultiXactId,
 								in_outer_xact);
 		}
-
+	}
+	else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
 		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
+		 * Partitioned tables don't have storage, so we don't set any fields
+		 * in their pg_class entries except for reltuples, which is necessary
+		 * for auto-analyze to work properly, and relhasindex.
 		 */
+		vac_update_relstats(onerel, -1, totalrows,
+							0, hasindex, InvalidTransactionId,
+							InvalidMultiXactId,
+							in_outer_xact);
+	}
+
+	/*
+	 * Now report ANALYZE to the stats collector.  For regular tables, we do
+	 * it only if not doing inherited stats.  For partitioned tables, we only
+	 * do it for inherited stats. (We're never called for not-inherited stats
+	 * on partitioned tables anyway.)
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
-	}
 
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a3c35bdf60..2a9673154b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1632,21 +1632,31 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector ends up with the right numbers if we abort instead of
 	 * committing.)
+	 *
+	 * For partitioned tables, we don't report live and dead tuples, because
+	 * such tables don't have any data.
 	 */
 	if (rel->pgstat_info != NULL)
 	{
 		PgStat_TableXactStatus *trans;
 
-		for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			/* If this rel is partitioned, skip modifying */
+			livetuples = deadtuples = 0;
+		else
 		{
-			livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-			deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+			{
+				livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+				deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+			}
+			/* count stuff inserted by already-aborted subxacts, too */
+			deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+			/* Since ANALYZE's counts are estimates, we could have underflowed */
+			livetuples = Max(livetuples, 0);
+			deadtuples = Max(deadtuples, 0);
 		}
-		/* count stuff inserted by already-aborted subxacts, too */
-		deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-		/* Since ANALYZE's counts are estimates, we could have underflowed */
-		livetuples = Max(livetuples, 0);
-		deadtuples = Max(deadtuples, 0);
+
 	}
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
@@ -1999,6 +2009,7 @@ pgstat_initstats(Relation rel)
 
 	/* We only count stats for things that have storage */
 	if (!RELKIND_HAS_STORAGE(relkind))
+// relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		rel->pgstat_info = NULL;
 		return;
-- 
2.17.0

#106Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#105)
Re: Autovacuum on partitioned table (autoanalyze)

On Tue, Aug 17, 2021 at 06:30:18AM -0500, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:42:48PM -0400, �lvaro Herrera wrote:

On 2021-Aug-16, �lvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

| I retained the addition of relkind 'p' to tables included by
| pg_stat_user_tables, because reverting that would require a catversion
| bump.

Right now, on v15dev, it shows 0, which is misleading.
Shouldn't it be null ?

analyze_count | 0

Note that having analyze_count and last_analyze would be an an independently
useful change. Since parent tables aren't analyzed automatically, I have a
script to periodically process them if they weren't processed recently. Right
now, for partitioned tables, the best I could find is to check its partitions:
| MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid

In 20200418050815.GE26953@telsasoft.com I wrote:
|This patch includes partitioned tables in pg_stat_*_tables, which is great; I
|complained awhile ago that they were missing [0]. It might be useful if that
|part was split out into a separate 0001 patch (?).
| [0] /messages/by-id/20180601221428.GU5164@telsasoft.com

I suggest the attached (which partially reverts the revert), to allow showing
correct data for analyze_count and last_analyzed.

�lvaro, would you comment on this ?

To me this could be an open item, but someone else should make that
determination.

--
Justin

#107Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#106)
Re: Autovacuum on partitioned table (autoanalyze)

On Fri, Aug 20, 2021 at 07:55:13AM -0500, Justin Pryzby wrote:

On Tue, Aug 17, 2021 at 06:30:18AM -0500, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote:

On Mon, Aug 16, 2021 at 05:42:48PM -0400, �lvaro Herrera wrote:

On 2021-Aug-16, �lvaro Herrera wrote:

Here's the reversal patch for the 14 branch. (It applies cleanly to
master, but the unused member of PgStat_StatTabEntry needs to be
removed and catversion bumped).

I have pushed this to both branches. (I did not remove the item from
the release notes in the 14 branch.)

| I retained the addition of relkind 'p' to tables included by
| pg_stat_user_tables, because reverting that would require a catversion
| bump.

Right now, on v15dev, it shows 0, which is misleading.
Shouldn't it be null ?

analyze_count | 0

Note that having analyze_count and last_analyze would be an an independently
useful change. Since parent tables aren't analyzed automatically, I have a
script to periodically process them if they weren't processed recently. Right
now, for partitioned tables, the best I could find is to check its partitions:
| MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON psat.relid=i.inhrelid

In 20200418050815.GE26953@telsasoft.com I wrote:
|This patch includes partitioned tables in pg_stat_*_tables, which is great; I
|complained awhile ago that they were missing [0]. It might be useful if that
|part was split out into a separate 0001 patch (?).
| [0] /messages/by-id/20180601221428.GU5164@telsasoft.com

I suggest the attached (which partially reverts the revert), to allow showing
correct data for analyze_count and last_analyzed.

�lvaro, would you comment on this ?

To me this could be an open item, but someone else should make that
determination.

I added an opened item until this is discussed.
| pg_stats includes partitioned tables, but always shows analyze_count=0
| Owner: Alvaro Herrera

Possible solutions, in decreasing order of my own preference:

- partially revert the revert, as proposed, to have "analyze_count" and
"last_analyzed" work properly for partitioned tables. This doesn't suffer
from any of the problems that led to the revert, does it ?

- Update the .c code to return analyze_count=NULL for partitioned tables.

- Update the catalog definition to exclude partitioned tables, again.
Requires a catalog bumped.

- Document that analyze_count=NULL for partitioned tables. It seems to just
document a misbehavior.

--
Justin

#108Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Justin Pryzby (#105)
Re: Autovacuum on partitioned table (autoanalyze)

On 2021-Aug-17, Justin Pryzby wrote:

I suggest the attached (which partially reverts the revert), to allow showing
correct data for analyze_count and last_analyzed.

Yeah, that makes sense and my keeping of the pg_stat_all_tables entries
seems pretty useless without this change. I have pushed a slightly
modified version of this to 14 and master.

Arguably these should be reported as null in v14 for partitioned tables, since
they're not "known to be zero", but rather "currently unpopulated".

n_mod_since_analyze | 0
n_ins_since_vacuum | 0

I don't disagree, but it's not easy to implement this at present. I
think almost all counters should be nulls for partitioned tables. For
some of them one could make a case that it'd be more convenient to
propagate numbers up from partitions.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/