Collecting statistics about contents of JSONB columns
Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.
Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.
A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).
But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1]https://github.com/postgrespro/postgres/tree/jsonb_stats developed by Nikita Glukhov.
I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.
I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...
Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2]https://github.com/tvondra/postgres/tree/jsonb_stats_rework, but you can
ignore that except if you want to see what I added/changed.
Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.
There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.
To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.
1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.
I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.
One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.
In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.
One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.
Another open question is how deep the stats should be. Imagine documents
like this:
{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"
and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.
2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:
jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.
So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.
This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.
This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).
Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).
Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.
I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.
In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.
regards
[1]: https://github.com/postgrespro/postgres/tree/jsonb_stats
[2]: https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Add-pg_operator.oprstat-for-derived-operato-20211230.patchtext/x-patch; charset=UTF-8; name=0001-Add-pg_operator.oprstat-for-derived-operato-20211230.patchDownload+147-1
0002-Add-stats_form_tuple-20211230.patchtext/x-patch; charset=UTF-8; name=0002-Add-stats_form_tuple-20211230.patchDownload+77-1
0003-Add-symbolic-names-for-some-jsonb-operators-20211230.patchtext/x-patch; charset=UTF-8; name=0003-Add-symbolic-names-for-some-jsonb-operators-20211230.patchDownload+26-20
0004-Add-helper-jsonb-functions-and-macros-20211230.patchtext/x-patch; charset=UTF-8; name=0004-Add-helper-jsonb-functions-and-macros-20211230.patchDownload+196-8
0005-Export-scalarineqsel-20211230.patchtext/x-patch; charset=UTF-8; name=0005-Export-scalarineqsel-20211230.patchDownload+5-2
0006-Add-jsonb-statistics-20211230.patchtext/x-patch; charset=UTF-8; name=0006-Add-jsonb-statistics-20211230.patchDownload+3944-11
On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.Another open question is how deep the stats should be. Imagine documents
like this:{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.regards
[1] https://github.com/postgrespro/postgres/tree/jsonb_stats
[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For patch 1:
+ List *statisticsName = NIL; /* optional stats estimat.
procedure */
I think if the variable is named estimatorName (or something similar), it
would be easier for people to grasp its purpose.
+ /* XXX perhaps full "statistics" wording would be better */
+ else if (strcmp(defel->defname, "stats") == 0)
I would recommend (stats sounds too general):
+ else if (strcmp(defel->defname, "statsestimator") == 0)
+ statisticsOid = ValidateStatisticsEstimator(statisticsName);
statisticsOid -> statsEstimatorOid
For get_oprstat():
+ }
+ else
+ return (RegProcedure) InvalidOid;
keyword else is not needed (considering the return statement in if block).
For patch 06:
+ /* FIXME Could be before the memset, I guess? Checking
vardata->statsTuple. */
+ if (!data->statsTuple)
+ return false;
I would agree the check can be lifted above the memset call.
+ * XXX This does not really extract any stats, it merely allocates the
struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
As comments says, I think allocJsonPathStats() would be better name for the
func.
+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?
I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check
should be added for jsonPathStatsGetArrayLengthStats().
To be continued ...
On Sat, Jan 1, 2022 at 7:33 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <
tomas.vondra@enterprisedb.com> wrote:Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.Another open question is how deep the stats should be. Imagine documents
like this:{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.regards
[1] https://github.com/postgrespro/postgres/tree/jsonb_stats
[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyHi,
For patch 1:+ List *statisticsName = NIL; /* optional stats estimat.
procedure */I think if the variable is named estimatorName (or something similar), it
would be easier for people to grasp its purpose.+ /* XXX perhaps full "statistics" wording would be better */ + else if (strcmp(defel->defname, "stats") == 0)I would recommend (stats sounds too general):
+ else if (strcmp(defel->defname, "statsestimator") == 0)
+ statisticsOid = ValidateStatisticsEstimator(statisticsName);
statisticsOid -> statsEstimatorOid
For get_oprstat():
+ } + else + return (RegProcedure) InvalidOid;keyword else is not needed (considering the return statement in if block).
For patch 06:
+ /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */ + if (!data->statsTuple) + return false;I would agree the check can be lifted above the memset call.
+ * XXX This does not really extract any stats, it merely allocates the struct? + */ +static JsonPathStats +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)As comments says, I think allocJsonPathStats() would be better name for
the func.+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what + * jsonPathStatsGetLengthStats does?I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check
should be added for jsonPathStatsGetArrayLengthStats().To be continued ...
Hi,
+static JsonPathStats
+jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen)
Stats appears twice in the method name. I think findJsonPathStats() should
suffice.
It should check `if (jsdata->nullfrac >= 1.0)` as jsonStatsGetPathStatsStr
does.
+JsonPathStats
+jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int
subpathlen)
This func can be static, right ?
I think findJsonPathStatsWithPrefix() would be a better name for the func.
+ * XXX Doesn't this need ecape_json too?
+ */
+static void
+jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len)
+{
+ char *tmpentry = pnstrdup(entry, len);
+ jsonPathAppendEntry(path, tmpentry);
ecape_json() is called within jsonPathAppendEntry(). The XXX comment can be
dropped.
+jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index)
It seems getJsonSelectivityWithArrayIndex() would be a better name.
+ sel = scalarineqsel(NULL, operator,
+ operator == JsonbGtOperator ||
+ operator == JsonbGeOperator,
+ operator == JsonbLeOperator ||
+ operator == JsonbGeOperator,
Looking at the comment for scalarineqsel():
* scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.
It seems JsonbLtOperator doesn't appear in the call, can I ask why ?
Cheers
On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
0006-Add-jsonb-statistics-20211230.patch
Hi Tomas,
-CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)
+CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)
https://cirrus-ci.com/task/6405547984420864
It looks like there is a Unicode BOM sequence in there, which is
upsetting our Windows testing but not the 3 Unixes (not sure why).
Probably added by an editor.
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.
Sounds good.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.Another open question is how deep the stats should be. Imagine documents
like this:{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.
The user interface for designing filters sounds hard, so I'd hope we
can ignore that for now.
--
Simon Riggs http://www.EnterpriseDB.com/
On 1/1/22 16:33, Zhihong Yu wrote:
Hi,
For patch 1:+ List *statisticsName = NIL; /* optional stats estimat.
procedure */I think if the variable is named estimatorName (or something similar),
it would be easier for people to grasp its purpose.
I agree "statisticsName" might be too generic or confusing, but I'm not
sure "estimator" is an improvement. Because this is not an "estimator"
(in the sense of estimating selectivity). It "transforms" statistics to
match the expression.
+ /* XXX perhaps full "statistics" wording would be better */ + else if (strcmp(defel->defname, "stats") == 0)I would recommend (stats sounds too general):
+ else if (strcmp(defel->defname, "statsestimator") == 0)
+ statisticsOid = ValidateStatisticsEstimator(statisticsName);
statisticsOid -> statsEstimatorOid
Same issue with the "estimator" bit :-(
For get_oprstat():
+ } + else + return (RegProcedure) InvalidOid;keyword else is not needed (considering the return statement in if block).
True, but this is how the other get_ functions in lsyscache.c do it.
Like get_oprjoin().
For patch 06:
+ /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */ + if (!data->statsTuple) + return false;I would agree the check can be lifted above the memset call.
OK.
+ * XXX This does not really extract any stats, it merely allocates the struct? + */ +static JsonPathStats +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)As comments says, I think allocJsonPathStats() would be better name for
the func.+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what + * jsonPathStatsGetLengthStats does?I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check
should be added for jsonPathStatsGetArrayLengthStats().To be continued ...
OK. I'll see if Nikita has some ideas about the naming changes.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 1/1/22 22:16, Zhihong Yu wrote:
Hi,
+static JsonPathStats +jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen)Stats appears twice in the method name. I think findJsonPathStats()
should suffice.
It should check `if (jsdata->nullfrac >= 1.0)`
as jsonStatsGetPathStatsStr does.+JsonPathStats +jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int subpathlen)This func can be static, right ?
I think findJsonPathStatsWithPrefix() would be a better name for the func.+ * XXX Doesn't this need ecape_json too? + */ +static void +jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len) +{ + char *tmpentry = pnstrdup(entry, len); + jsonPathAppendEntry(path, tmpentry);ecape_json() is called within jsonPathAppendEntry(). The XXX comment can
be dropped.+jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index)
It seems getJsonSelectivityWithArrayIndex() would be a better name.
Thanks. I'll think about the naming changes.
+ sel = scalarineqsel(NULL, operator, + operator == JsonbGtOperator || + operator == JsonbGeOperator, + operator == JsonbLeOperator || + operator == JsonbGeOperator,Looking at the comment for scalarineqsel():
* scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.It seems JsonbLtOperator doesn't appear in the call, can I ask why ?
Because the scalarineqsel signature is this
scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
Oid collation,
VariableStatData *vardata, Datum constval,
Oid consttype)
so
/* is it greater or greater-or-equal */
isgt = operator == JsonbGtOperator ||
operator == JsonbGeOperator
/* is it equality? */
iseq = operator == JsonbLeOperator ||
operator == JsonbGeOperator,
So I think this is correct. A comment explaining this would be nice.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 1/5/22 21:13, Thomas Munro wrote:
On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:0006-Add-jsonb-statistics-20211230.patch
Hi Tomas,
-CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) +CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)https://cirrus-ci.com/task/6405547984420864
It looks like there is a Unicode BOM sequence in there, which is
upsetting our Windows testing but not the 3 Unixes (not sure why).
Probably added by an editor.
Thanks, fixed along with some whitespace issues.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Add-pg_operator.oprstat-for-derived-operato-20220106.patchtext/x-patch; charset=UTF-8; name=0001-Add-pg_operator.oprstat-for-derived-operato-20220106.patchDownload+147-1
0002-Add-stats_form_tuple-20220106.patchtext/x-patch; charset=UTF-8; name=0002-Add-stats_form_tuple-20220106.patchDownload+77-1
0003-Add-symbolic-names-for-some-jsonb-operators-20220106.patchtext/x-patch; charset=UTF-8; name=0003-Add-symbolic-names-for-some-jsonb-operators-20220106.patchDownload+26-20
0004-Add-helper-jsonb-functions-and-macros-20220106.patchtext/x-patch; charset=UTF-8; name=0004-Add-helper-jsonb-functions-and-macros-20220106.patchDownload+196-8
0005-Export-scalarineqsel-20220106.patchtext/x-patch; charset=UTF-8; name=0005-Export-scalarineqsel-20220106.patchDownload+5-2
0006-Add-jsonb-statistics-20220106.patchtext/x-patch; charset=UTF-8; name=0006-Add-jsonb-statistics-20220106.patchDownload+3944-11
On 1/5/22 21:22, Simon Riggs wrote:
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.Sounds good.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.Another open question is how deep the stats should be. Imagine documents
like this:{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.The user interface for designing filters sounds hard, so I'd hope we
can ignore that for now.
Not sure I understand. I wasn't suggesting any user-defined filtering,
but something done by default, similarly to what we do for regular MCV
lists, based on frequency. We'd include frequent paths while excluding
rare ones.
So no need for a user interface.
That might not work for documents with stable schema and a lot of
top-level paths, because all the top-level paths have 1.0 frequency. But
for documents with dynamic schema (different documents having different
schemas/paths) it might help.
Similarly for the non-scalar values - I don't think we can really keep
regular statistics on such values (for the same reason why it's not
enough for whole JSONB columns), so why to build/store that anyway.
Nikita did implement a way to specify custom filters using jsonpath, but
I did not include that into this patch series. And questions regarding
the interface were one of the reasons.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi!
I am glad that you found my very old patch interesting and started to
work on it. We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage. Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.
I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats:
- iterate through MCV/Histogram JSONB arrays
- cast numeric values to float, string to text etc.
- build PG arrays from extracted datums
- form pg_statistic tuple.
With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead. But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.
Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.
Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS. But it is really unclear how to:
* store pg_statistic_ext rows from typanalyze
* form path expressions for array elements (maybe add new jsonpath
operator)
* match various -> operator chains to stxexprs
* jsonpath-like languages need simple API for searching by stxexprs
Per-path statistics should only be collected for scalars. This can be
enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case. Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()). Instead, we will have to implement own
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).
It's possible to replace objects and arrays with empty ones when
scalarsOnly is set to keep correct frequencies of non-scalars.
But there is an old bug in JSONB comparison: empty arrays are placed
before other values in the JSONB sort order, although they should go
after all scalars. So we need also to distinguish empty and non-empty
arrays here.
I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached. There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1]https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122.
Below is the explanation of fixed XXXs:
Patch 0001
src/backend/commands/operatorcmds.c:
XXX Maybe not the right name, because "estimator" implies we're
calculating selectivity. But we're actually deriving statistics for
an expression.
Renamed to "Derivator".
XXX perhaps full "statistics" wording would be better
Renamed to "statistics".
src/backend/utils/adt/selfuncs.c:
examine_operator_expression():
XXX Not sure why this returns bool - we ignore the return value
anyway. We might as well return the calculated vardata (or NULL).
Oprstat was changed to return void.
XXX Not sure what to do about recursion - there can be another OpExpr
in one of the arguments, and we should call this recursively from the
oprstat procedure. But that's not possible, because it's marked as
static.
Oprstat call chain: get_restriction_variable() => examine_variable() =>
examine_operator_expression().
The main thing here is that OpExpr with oprstat acts like ordinary Var.
examine_variable():
XXX Shouldn't this put more restrictions on the OpExpr? E.g. that
one of the arguments has to be a Const or something?
This is simply a responsibility of oprstat.
Patch 0004
XXX Not sure we want to add these functions to jsonb.h, which is the
public API. Maybe it belongs rather to jsonb_typanalyze.c or
elsewhere, closer to how it's used?
Maybe it needs to be moved the new file jsonb_utils.h. I think these
functions become very useful, when we start to build JSONBs with
predefined structure.
pushJsonbValue():
XXX I'm not quite sure why we actually do this? Why do we need to
change how JsonbValue is converted to Jsonb for the statistics patch?
Scalars in JSONB are encoded as one-element preudo-array containers.
So when we are inserting binary JsonbValues, that was initialized
directly from JSONB datums, inside other non-empty JSONB in
pushJsonbValue(), all values except scalars are inserted as
expected but scalars become [scalar]. So we need to extract scalar
values in caller functions or in pushJsonbValue(). I think
autoextraction in pushJsonbValue() is better. This case simply was not
used before JSONB stats introduction.
Patch 0006
src/backend/utils/adt/jsonb_selfuncs.c
jsonPathStatsGetSpecialStats()
XXX This does not really extract any stats, it merely allocates the
struct?
Renamed with "Alloc" suffix.
jsonPathStatsGetArrayLengthStats()
XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to
what jsonPathStatsGetLengthStats does?
"length" stats were collected inside parent paths, but "array_length"
and "avg_array_length" stats were collected inside child array paths.
This resulted in inconsistencies in TypeFreq checks.
I have removed "length" stats, moved "array_length" and
"avg_array_length" to the parent path, and added separate
"object_length" stats. TypeFreq checks become consistent.
XXX Seems to do essentially what jsonStatsFindPath, except that it
also considers jsdata->prefix. Seems fairly easy to combine those
into a single function.
I don't think that it would be better to combine those function into
one with considerPrefix flag, because jsonStatsFindPathStats() is a
some kind of low-level function which is called only in two places.
In all other places considerPrefix will be true. Also
jsonStatsFindPathStatsStr() is exported in jsonb_selfuncs.h for to give
external jsonpath-like query operators ability to use JSON statistics.
jsonPathStatsCompare()
XXX Seems a bit convoluted to first cast it to Datum, then Jsonb ...
Datum const *pdatum = pv2;
Jsonb *jsonb = DatumGetJsonbP(*pdatum);
The problem of simply using 'jsonb = *(Jsonb **) pv2' is that
DatumGetJsonbP() may deTOAST datums.
XXX Not sure about this? Does empty path mean global stats?
Empty "path" is simply a sign of invalid stats data.
jsonStatsConvertArray()
FIXME Does this actually work on all 32/64-bit systems? What if typid
is FLOAT8OID or something? Should look at TypeCache instead, probably.
Used get_typlenbyvalalign() instead of hardcoded values.
jsonb_stats()
XXX It might be useful to allow recursion, i.e.
get_restriction_variable might derive statistics too.
I don't think it does that now, right?
get_restriction_variable() already might derive statistics: it calls
examine_variable() on its both operands, and examine_variable() calls
examine_operator_expression().
XXX Could we also get varonleft=false in useful cases?
All supported JSONB operators have signature jsonb OP arg.
If varonleft = false, then we need to derive stats for expression like
'{"foo": "bar"}' -> text_column
having stats for text_column. It is possible to implement too, but it
is not related to JSONB stats and needs to be done in the separate
patch.
jsonSelectivityContains():
XXX This really needs more comments explaining the logic.
I have refactored this function and added comments.
jsonGetFloat4():
XXX Not sure assert is a sufficient protection against different
types of JSONB values to be passed in.
I have refactored this function by passing a default value for
non-numeric JSONB case.
jsonPathAppendEntryWithLen()
XXX Doesn't this need ecape_json too?
Comment is removed, because jsonPathAppendEntry() is called inside.
jsonPathStatsGetTypeFreq()
FIXME This is really hard to read/understand, with two nested ternary
operators.
I have refactored this place.
XXX Seems more like an error, no? Why ignore it?
It's not an error, it's request of non-numeric type. Length values
are always numeric, and frequence of non-numeric values is 0.
The possible case when it could happen is jsonpath '$.size() == "foo"'.
Estimator for operator == will check frequence of strings in .size()
and it will be 0.
jsonPathStatsFormTuple()
FIXME What does this mean?
There is no need to transform ordinary root path stats tuple, it can be
simply copied.
jsonb_typanalyze.c
XXX We need entry+lenth because JSON path elements may contain null
bytes, I guess?
'entry' can be not zero-terminated when it is pointing to JSONB keys,
so 'len' is necessary field. 'len' is also used for faster entry
comparison, to distinguish array entries ('len' == -1).
XXX Sould be JsonPathEntryMatch as it deals with JsonPathEntry nodes
not whole paths, no?
XXX Again, maybe JsonPathEntryHash would be a better name?
Functions renamed using JsonPathEntry prefix, JsonPath typedef removed.
JsonPathMatch()
XXX Seems a bit silly to return int, when the return statement only
really returns bool (because of how it compares paths). It's not really
a comparator for sorting, for example.
This function is implementation of HashCompareFunc and it needs to
return int.
jsonAnalyzeJson()
XXX The name seems a bit weird, with the two json bits.
Renamed to jsonAnalyzeCollectPaths(). The two similar functions were
renamed too.
jsonAnalyzeJson():
XXX The mix of break/return statements in this block is really
confusing.
I have refactored this place using only breaks.
XXX not sure why we're doing this?
Manual recursion into containers by creating child iterator together
with skipNested=true flag is used to give jsonAnalyzeJsonValue()
ability to access jbvBinary containers.
compute_json_stats()
XXX Not sure what the first branch is doing (or supposed to)?
XXX It's not immediately clear why this is (-1) and not simply
NULL. It crashes, so presumably it's used to tweak the behavior,
but it's not clear why/how, and it affects place that is pretty
far away, and so not obvious. We should use some sort of flag
with a descriptive name instead.
XXX If I understand correctly, we simply collect all paths first,
without accumulating any Values. And then in the next step we
process each path independently, probably to save memory (we
don't want to accumulate all values for all paths, with a lot
of duplicities).
There are two variants of stats collection:
* single-pass - collect all values for all paths
* multi-pass - collect only values for a one path at each pass
The first variant can consume too much memory (jsonb iteration produces
a lot of garbage etc.), but works faster than second.
The first 'if (false)' is used for manual selection of one of this
variants. This selection should be controlled by some user-specified
option (maybe GUC), or the first variant can be simply removed.
jsonAnalyzeJson()'s parameter of type JsonPathAnlStats * determines
which paths we need to consider for the value collection:
* NULL - collect values for all paths
* -1 - do not collect any values
* stats - collect values only for a given path
The last variant really is unused because we already have another
function jsonAnalyzeJsonPath(), which is optimized for selective path
values collection (used object key accessor JSONB functions instead of
full JSONB iteration). I have replaced this strange parameter with
simple boolean flag.
XXX Could the parameters be different on other platforms?
Used get_typlenbyvalalign(JSONBOID) instead of hardcoded values.
jsonAnalyzePathValues()
XXX Do we need to initialize all slots?
I have copied here the following comment from extended_stats.c:
/*
* The fields describing the stats->stavalues[n] element types default
* to the type of the data being analyzed, but the type-specific
* typanalyze function can change them if it wants to store something
* else.
*/
XXX Not sure why we divide it by the number of json values?
We divide counts of lengths by the total number of json values to
compute correct nullfrac. I.e. not all input jsons have lengths,
length of scalar jsons is NULL.
[1]: https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122
--
Nikita Glukhov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
0001-Add-pg_operator.oprstat-for-derived-operator-statist.patchtext/x-patch; charset=UTF-8; name=0001-Add-pg_operator.oprstat-for-derived-operator-statist.patchDownload+140-1
0002-Add-stats_form_tuple.patchtext/x-patch; charset=UTF-8; name=0002-Add-stats_form_tuple.patchDownload+77-1
0003-Add-symbolic-names-for-some-jsonb-operators.patchtext/x-patch; charset=UTF-8; name=0003-Add-symbolic-names-for-some-jsonb-operators.patchDownload+26-20
0004-Add-helper-jsonb-functions-and-macros.patchtext/x-patch; charset=UTF-8; name=0004-Add-helper-jsonb-functions-and-macros.patchDownload+195-8
0005-Export-scalarineqsel.patchtext/x-patch; charset=UTF-8; name=0005-Export-scalarineqsel.patchDownload+5-2
0006-Add-jsonb-statistics.patchtext/x-patch; charset=UTF-8; name=0006-Add-jsonb-statistics.patchDownload+4168-12
On 1/23/22 01:24, Nikita Glukhov wrote:
Hi!
I am glad that you found my very old patch interesting and started to
work on it. We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage. Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.
Understood. Let's see how feasible this idea is and if we can move this
forward.
I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats:
- iterate through MCV/Histogram JSONB arrays
- cast numeric values to float, string to text etc.
- build PG arrays from extracted datums
- form pg_statistic tuple.With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead. But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.
I'm not against evaluating/exploring alternative storage formats, but my
feeling is the real impact on performance will be fairly low. At least I
haven't seen this as very expensive while profiling the patch so far. Of
course, I may be wrong, and it may be more significant in some cases.
Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.
Maybe, but then you still need to store a bunch of those, right? So
either an array (likely toasted) or 1:M table. I'm not sure it's goiing
to be much cheaper than JSONB.
I'd suggest we focus on what we need to store first, which seems like
tha primary question, and worry about the exact storage format then.
Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS. But it is really unclear how to:
* store pg_statistic_ext rows from typanalyze
* form path expressions for array elements (maybe add new jsonpath
operator)
* match various -> operator chains to stxexprs
* jsonpath-like languages need simple API for searching by stxexprs
Sure, you can do statistics on expressions, right? Of course, if that
expression produces JSONB value, that's not very useful at the moment.
Maybe we could have two typanalyze functions - one for regular analyze,
one for extended statistics?
That being said, I'm not sure extended stats are a good match for this.
My feeling was we should collect these stats for all JSONB columns,
which is why I argued for putting that in pg_statistic.
Per-path statistics should only be collected for scalars. This can be
enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case. Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()). Instead, we will have to implement own
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).
Yeah, but maybe that's an acceptable trade-off? I mean, if we can
improve estimates for most clauses, and there's a some number of clauses
that are estimated just like without stats, that's still an improvement,
right?
It's possible to replace objects and arrays with empty ones when
scalarsOnly is set to keep correct frequencies of non-scalars.
But there is an old bug in JSONB comparison: empty arrays are placed
before other values in the JSONB sort order, although they should go
after all scalars. So we need also to distinguish empty and non-empty
arrays here.
Hmmm ...
I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached. There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1].
Thanks! I'll go through the changes soon.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 1/23/22 01:24, Nikita Glukhov wrote:
Hi!
I am glad that you found my very old patch interesting and started to
work on it. We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage. Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.Understood. Let's see how feasible this idea is and if we can move this
forward.I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats:
- iterate through MCV/Histogram JSONB arrays
- cast numeric values to float, string to text etc.
- build PG arrays from extracted datums
- form pg_statistic tuple.With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead. But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.I'm not against evaluating/exploring alternative storage formats, but my
feeling is the real impact on performance will be fairly low. At least I
haven't seen this as very expensive while profiling the patch so far. Of
course, I may be wrong, and it may be more significant in some cases.Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.Maybe, but then you still need to store a bunch of those, right? So
either an array (likely toasted) or 1:M table. I'm not sure it's goiing
to be much cheaper than JSONB.I'd suggest we focus on what we need to store first, which seems like
tha primary question, and worry about the exact storage format then.Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS. But it is really unclear how to:
* store pg_statistic_ext rows from typanalyze
* form path expressions for array elements (maybe add new jsonpath
operator)
* match various -> operator chains to stxexprs
* jsonpath-like languages need simple API for searching by stxexprsSure, you can do statistics on expressions, right? Of course, if that
expression produces JSONB value, that's not very useful at the moment.
Maybe we could have two typanalyze functions - one for regular analyze,
one for extended statistics?That being said, I'm not sure extended stats are a good match for this.
My feeling was we should collect these stats for all JSONB columns,
which is why I argued for putting that in pg_statistic.Per-path statistics should only be collected for scalars. This can be
enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case. Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()). Instead, we will have to implement own
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).Yeah, but maybe that's an acceptable trade-off? I mean, if we can
improve estimates for most clauses, and there's a some number of clauses
that are estimated just like without stats, that's still an improvement,
right?It's possible to replace objects and arrays with empty ones when
scalarsOnly is set to keep correct frequencies of non-scalars.
But there is an old bug in JSONB comparison: empty arrays are placed
before other values in the JSONB sort order, although they should go
after all scalars. So we need also to distinguish empty and non-empty
arrays here.Hmmm ...
I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached. There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1].Thanks! I'll go through the changes soon.
Thanks, Nikita and Tomas for these patches.
For the last few days, I was trying to understand these patches, and based
on Tomas's suggestion, I was doing some performance tests.
With the attached .SQL file, I can see that analyze is taking more time
with these patches.
*Setup: *
autovacuum=off
rest all are default settings.
Insert attached file with and without the patch to compare the time taken
by analyze.
*With json patches:*
postgres=# analyze test ;
ANALYZE
Time: *28464.062 ms (00:28.464)*
postgres=#
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
328 kB
(1 row)
--
*Without json patches:*
postgres=# analyze test ;
ANALYZE
*Time: 120.864* ms
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
272 kB
I haven't found the root cause of this but I feel that this time is due to
a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing
analysis, first we check all the sample rows(30000) and we collect all the
different-2 paths (here 951), and after that for every single path, we loop
over all the sample rows again to collect stats for a particular path. I
feel that these loops might be taking time.
I will run perf and will try to find out the root cause of this.
Apart from this performance issue, I haven't found any crashes or issues.
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Attachments:
On Thu, 6 Jan 2022 at 14:56, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Not sure I understand. I wasn't suggesting any user-defined filtering,
but something done by default, similarly to what we do for regular MCV
lists, based on frequency. We'd include frequent paths while excluding
rare ones.So no need for a user interface.
Not sure but I think he was agreeing with you. That we should figure
out the baseline behaviour and get it as useful as possible first then
later look at adding some way to customize it. I agree -- I don't
think the user interface will be hard technically but I think it will
require some good ideas and there could be lots of bikeshedding. And a
lot of users will never even use it anyways so it's important to get
the defaults as useful as possible.
Similarly for the non-scalar values - I don't think we can really keep
regular statistics on such values (for the same reason why it's not
enough for whole JSONB columns), so why to build/store that anyway.
For a default behaviour I wonder if it wouldn't be better to just
flatten and extract all the scalars. So if there's no matching path
then at least we have some way to estimate how often a scalar appears
anywhere in the json document.
That amounts to assuming the user knows the right path to find a given
scalar and there isn't a lot of overlap between keys. So it would at
least do something useful if you have something like {gender: female,
name: {first: nancy, last: reagan], state: california, country: usa}.
It might get things slightly wrong if you have some people named
"georgia" or have names that can be first or last names.
But it would generally be doing something more or less useful as long
as they look for "usa" in the country field and "male" in the gender
field. If they looked for "male" in $.name.first path it would give
bad estimates but assuming they know their data structure they won't
be doing that.
--
greg
On 1/25/22 17:56, Mahendra Singh Thalor wrote:
...
For the last few days, I was trying to understand these patches, and
based on Tomas's suggestion, I was doing some performance tests.With the attached .SQL file, I can see that analyze is taking more time
with these patches.*Setup: *
autovacuum=off
rest all are default settings.Insert attached file with and without the patch to compare the time
taken by analyze.*With json patches:*
postgres=# analyze test ;
ANALYZE
Time: *28464.062 ms (00:28.464)*
postgres=#
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
328 kB
(1 row)
--*Without json patches:*
postgres=# analyze test ;
ANALYZE
*Time: 120.864* ms
postgres=# SELECT pg_size_pretty(
pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
272 kBI haven't found the root cause of this but I feel that this time is due
to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing
analysis, first we check all the sample rows(30000) and we collect all
the different-2 paths (here 951), and after that for every single path,
we loop over all the sample rows again to collect stats for a particular
path. I feel that these loops might be taking time.I will run perf and will try to find out the root cause of this.
Thanks, I've been doing some performance tests too, and you're right it
takes quite a bit of time. I wanted to compare how the timing changes
with complexity of the JSON documents (nesting, number of keys, ...) so
I wrote a simple python script to generate random JSON documents with
different parameters - see the attached json-generate.py script.
It's a bit crude, but it generates synthetic documents with a chosen
number of levels, keys per level, distinct key values, etc. The
generated documents are loaded directly into a "json_test" database,
into a table "test_table" with a single jsonb column called "v".
Tweaking this to connect to a different database, or just dump the
generated documents to a file, should be trivial.
The attached bash script runs the data generator for a couple of
combinations, and them measures how long it takes to analyze the table,
how large the statistics are (in a rather crude way), etc.
The results look like this (the last two columns are analyze duration in
milliseconds, for master and with the patch):
levels keys unique keys paths master patched
----------------------------------------------------------
1 1 1 2 153 122
1 1 1000 1001 134 1590
1 8 8 9 157 367
1 8 1000 1001 155 1838
1 64 64 65 189 2298
1 64 1000 1001 46 9322
2 1 1 3 237 197
2 1 1000 30580 152 46468
2 8 8 73 245 1780
So yeah, it's significantly slower - in most cases not as much as you
observed, but an order of magnitude slower than master. For size of the
statistics, it's similar:
levels keys unique keys paths table size master patched
------------------------------------------------------------------
1 1 1 2 1843200 16360 24325
1 1 1000 1001 1843200 16819 1425400
1 8 8 9 4710400 28948 88837
1 8 1000 1001 6504448 42694 3915802
1 64 64 65 30154752 209713 689842
1 64 1000 1001 49086464 1093 7755214
2 1 1 3 2572288 24883 48727
2 1 1000 30580 2572288 11422 26396365
2 8 8 73 23068672 164785 862258
This is measured by by dumping pg_statistic for the column, so in
database it might be compressed etc. It's larger, but that's somewhat
expected because we simply store more detailed stats. The size grows
with the number of paths extracted - which is expected, of course.
If you noticed why this doesn't show data for additional combinations
(e.g. 2 levels 8 keys and 1000 distinct key values), then that's the bad
news - that takes ages (multiple minutes) and then it gets killed by OOM
killer because it eats gigabytes of memory.
I agree the slowness is largely due to extracting all paths and then
processing them one by one - which means we have to loop over the tuples
over and over. In this case there's about 850k distinct paths extracted,
so we do ~850k loops over 30k tuples. That's gotta take time.
I don't know what exactly to do about this, but I already mentioned we
may need to pick a subset of paths to keep, similarly to how we pick
items for MCV. I mean, if we only saw a path once or twice, it's
unlikely to be interesting enough to build stats for it. I haven't
tried, but I'd bet most of the 850k paths might be ignored like this.
The other thing we might do is making it the loops more efficient. For
example, we might track which documents contain each path (by a small
bitmap or something), so that in the loop we can skip rows that don't
contain the path we're currently processing. Or something like that.
Of course, this can't eliminate all the overhead - we've building more
stats and that has a cost. In the "common" case of stable "fixed" schema
with the same paths in all documents we'll still need to do loop for
each of them. So it's bound to be slower than master.
Which probably means it's a bad idea to do this for all JSONB columns,
because in many cases the extra stats are not needed so the extra
analyze time would be a waste. So I guess we'll need some way to enable
this only for selected columns ... I argued against the idea to
implement this as extended statistics in the first message, but it's a
reasonably nice way to do such stuff (expression stats are a precedent).
Apart from this performance issue, I haven't found any crashes or issues.
Well, I haven't seen any crashes either, but as I mentioned for complex
documents (2 levels, many distinct keys) the ANALYZE starts consuming a
lot of memory and may get killed by OOM. For example if you generate
documents like this
./json-generate.py 30000 2 8 1000 6 1000
and then run ANALYZE, that'll take ages and it very quickly gets into a
situation like this (generated from gdb by calling MemoryContextStats on
TopMemoryContext):
-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784
used
...
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536
used: <unnamed>
Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks);
469000720 used
Analyze Column: 921177696 total in 120 blocks; 5123256 free
(238 chunks); 916054440 used
Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free
(1 chunks); 2472 used
Json Analyze Pass Context: 8192 total in 1 blocks; 7928
free (0 chunks); 264 used
JSON analyze path table: 1639706040 total in 25084 blocks;
1513640 free (33 chunks); 1638192400 used
Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352
chunks); 3024345064 used
-------------------------------------------------------------------------
Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze
path table" context, 400MB in "analyze" and 900MB in "analyze column"
contexts. I mean, that seems a bit excessive. And it grows over time, so
after a while my laptop gives up and kills the backend.
I'm not sure if it's a memory leak (which would be fixable), or it's due
to keeping stats for all the extracted paths. I mean, in this particular
case we have 850k paths - even if stats are just 1kB per path, that's
850MB. This requires more investigation.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/4/22 03:47, Tomas Vondra wrote:
./json-generate.py 30000 2 8 1000 6 1000
Sorry, this should be (different order of parameters):
./json-generate.py 30000 2 1000 8 6 1000
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 04.02.2022 05:47, Tomas Vondra wrote:
On 1/25/22 17:56, Mahendra Singh Thalor wrote:
...
For the last few days, I was trying to understand these patches, and
based on Tomas's suggestion, I was doing some performance tests.With the attached .SQL file, I can see that analyze is taking more
time with these patches.I haven't found the root cause of this but I feel that this time is
due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While
doing analysis, first we check all the sample rows(30000) and we
collect all the different-2 paths (here 951), and after that for
every single path, we loop over all the sample rows again to collect
stats for a particular path. I feel that these loops might be taking
time.Thanks, I've been doing some performance tests too, and you're right
it takes quite a bit of time.
That is absolutely not surprising, I have warned about poor performance
in cases with a large number of paths.
I agree the slowness is largely due to extracting all paths and then
processing them one by one - which means we have to loop over the
tuples over and over. In this case there's about 850k distinct paths
extracted, so we do ~850k loops over 30k tuples. That's gotta take time.I don't know what exactly to do about this, but I already mentioned we
may need to pick a subset of paths to keep, similarly to how we pick
items for MCV. I mean, if we only saw a path once or twice, it's
unlikely to be interesting enough to build stats for it. I haven't
tried, but I'd bet most of the 850k paths might be ignored like this.The other thing we might do is making it the loops more efficient. For
example, we might track which documents contain each path (by a small
bitmap or something), so that in the loop we can skip rows that don't
contain the path we're currently processing. Or something like that.Apart from this performance issue, I haven't found any crashes or
issues.Well, I haven't seen any crashes either, but as I mentioned for
complex documents (2 levels, many distinct keys) the ANALYZE starts
consuming a lot of memory and may get killed by OOM. For example if
you generate documents like this./json-generate.py 30000 2 8 1000 6 1000
and then run ANALYZE, that'll take ages and it very quickly gets into
a situation like this (generated from gdb by calling
MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll
take ages and it very quickly gets into a situation like this
(generated from gdb by calling MemoryContextStats on TopMemoryContext):-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks);
66784 used
...
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536
used
PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536
used: <unnamed>
Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks);
469000720 used
Analyze Column: 921177696 total in 120 blocks; 5123256 free
(238 chunks); 916054440 used
Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free
(1 chunks); 2472 used
Json Analyze Pass Context: 8192 total in 1 blocks; 7928
free (0 chunks); 264 used
JSON analyze path table: 1639706040 total in 25084 blocks;
1513640 free (33 chunks); 1638192400 used
Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352
chunks); 3024345064 used
-------------------------------------------------------------------------Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze
path table" context, 400MB in "analyze" and 900MB in "analyze column"
contexts. I mean, that seems a bit excessive. And it grows over time,
so after a while my laptop gives up and kills the backend.I'm not sure if it's a memory leak (which would be fixable), or it's
due to keeping stats for all the extracted paths. I mean, in this
particular case we have 850k paths - even if stats are just 1kB per
path, that's 850MB. This requires more investigation.
Thank you for the tests and investigation.
I have tried to reduce memory consumption and speed up row scanning:
1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
structure per JSON path in the global hash table. I have moved
JsonPathAnlStats to the stack of compute_json_stats(), and hash
table now consumes ~70 bytes per path.
2. I have fixed copying of resulting JSONB stats into context, which
reduced the size of "Analyze Column" context.
3. I have optimized consumption of single-pass algorithm by storing
only value lists in the non-temporary context. That helped to
execute "2 64 64" test case in 30 seconds. Single-pass is a
bit faster in non-TOASTed cases, and much faster in TOASTed.
But it consumes much more memory and still goes to OOM in the
cases with more than ~100k paths.
4. I have implemented per-path document lists/bitmaps, which really
speed up the case "2 8 1000". List is converted into bitmap when
it becomes larger than bitmap.
5. Also I have fixed some bugs.
All these changes you can find commit form in our GitHub repository
on the branch jsonb_stats_20220310 [1]https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310.
Updated results of the test:
levels keys uniq keys paths master multi-pass single-pass
ms MB ms MB
-------------------------------------------------------------------
1 1 1 2 153 122 10 82 14
1 1 1000 1001 134 105 11 78 38
1 8 8 9 157 384 19 328 32
1 8 1000 1001 155 454 23 402 72
1 64 64 65 129 2889 45 2386 155
1 64 1000 1001 158 3990 94 1447 177
2 1 1 3 237 147 10 91 16
2 1 1000 30577 152 264 32 394 234
2 8 8 72 245 1943 37 1692 139
2 8 1000 852333 152 9175 678 OOM
2 64 64 4161 1784 ~1 hour 53 30018 1750
2 64 1000 1001001 4715 ~4 hours 1600 OOM
The two last multi-pass results are too slow, because JSONBs becomes
TOASTed. For measuring master in these tests, I disabled
WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.
Next, I am going to try to disable all-paths collection and implement
collection of most common paths (and/or hashed paths maybe).
[1]: https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310
--
Nikita Glukhov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
0001-Add-pg_operator.oprstat-for-derived-operator-statist-20220310.patchtext/x-patch; charset=UTF-8; name=0001-Add-pg_operator.oprstat-for-derived-operator-statist-20220310.patchDownload+140-1
0002-Add-stats_form_tuple-20220310.patchtext/x-patch; charset=UTF-8; name=0002-Add-stats_form_tuple-20220310.patchDownload+77-1
0003-Add-symbolic-names-for-some-jsonb-operators-20220310.patchtext/x-patch; charset=UTF-8; name=0003-Add-symbolic-names-for-some-jsonb-operators-20220310.patchDownload+26-20
0004-Add-helper-jsonb-functions-and-macros-20220310.patchtext/x-patch; charset=UTF-8; name=0004-Add-helper-jsonb-functions-and-macros-20220310.patchDownload+195-8
0005-Export-scalarineqsel-20220310.patchtext/x-patch; charset=UTF-8; name=0005-Export-scalarineqsel-20220310.patchDownload+5-2
0006-Add-jsonb-statistics-20220310.patchtext/x-patch; charset=UTF-8; name=0006-Add-jsonb-statistics-20220310.patchDownload+4435-12
On Fri, 4 Feb 2022 at 08:30, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 2/4/22 03:47, Tomas Vondra wrote:
./json-generate.py 30000 2 8 1000 6 1000
Sorry, this should be (different order of parameters):
./json-generate.py 30000 2 1000 8 6 1000
Thanks, Tomas for this test case.
Hi Hackers,
For the last few days, I was doing testing on the top of these JSON
optimizers patches and was taking help fro Tomas Vondra to understand
patches and testing results.
Thanks, Tomas for your feedback and suggestions.
Below is the summary:
*Point 1)* analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the
current head. For reference, I am attaching test file (./json-generate.py
30000 2 1000 8 6 1000)
Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours
analyze is taking a very large time because with these patches, firstly we
iterate over all sample rows (in above case 30000), and we store all the
paths (here around 850k paths).
In another pass, we took 1 path at a time and collects stats for the
particular path by analyzing all the sample rows and we continue this
process for all 850k paths or we can say that we do 850k loops, and in each
loop we extract values for a single path.
*Point 2)* memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for
one path, we allocate 1120 bytes.
Total paths : 852689 ~ 852k
Memory for 1 path to keep stats: 1120 ~ 1 KB
(sizeof(JsonValueStats) = 1120 from “Analyze Column”)
Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB
Extra memory for each path will be more. I mean, while analyzing each path,
we allocate some more memory based on frequency and others
To keep all entries(851k paths) in the hash, we use around 1GB memory for
hash so this is also very large.
*Point 3*) Review comment noticed by Tomas Vondra:
+ oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
Above should be:
+ oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
*Response from Tomas Vondra:*
The problem is "anl_context" is actually "Analyze", i.e. the context for
the whole ANALYZE command, for all the columns. But we only want to keep
those path stats while processing a particular column. At the end, after
processing all paths from a column, we need to "build" the final stats in
the column, and this result needs to go into "Analyze" context. But all the
partial results need to go into "Analyze Column" context.
*Point 4)*
+/*
+ * jsonAnalyzeCollectPath
+ * Extract a single path from JSON documents and collect its
values.
+ */
+static void
+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)
+{
+ JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;
+ JsonbValue jbvtmp;
+ JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);
+ JsonPathEntry *path;
+ JsonPathEntry **entries;
+ int i;
+
+ entries = palloc(sizeof(*entries) * pstats->depth);
+
+ /* Build entry array in direct order */
+ for (path = &pstats->path, i = pstats->depth - 1;
+ path->parent && i >= 0;
+ path = path->parent, i--)
+ entries[i] = path;
+
+ jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);
+
+ pfree(entries);
----many times, we are trying to palloc with zero size and entries is
pointing to invalid memory (because pstats->depth=0) so I think, we should
not try to palloc with 0??
*Fix:*
+ If (pstats->depth)
+ entries = palloc(sizeof(*entries) * pstats->depth);
From these points, we can say that we should rethink our design to collect
stats for all paths.
We can set limits(like MCV) for paths or we can give an explicit path to
collect stats for a particular path only or we can pass a subset of the
JSON values.
In the above case, there are total 851k paths, but we can collect stats for
only 1000 paths that are most common so this way we can minimize time and
memory also and we might even keep at
least frequencies for the non-analyzed paths.
Next, I will take the latest patches from Nikita's last email and I will do
more tests.
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
This patch has bitrotted, presumably after the other JSON patchset was
applied. It looks like it's failing in the json header file so it may
be as simple as additional functions added on nearby lines.
Please rebase. Reminder, it's the last week of the commitfest so time
is of the essence....
I noticed some typos.
diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c
index f5520f88a1d..d98cd7020a1 100644
--- a/src/backend/utils/adt/jsonb_selfuncs.c
+++ b/src/backend/utils/adt/jsonb_selfuncs.c
@@ -1342,7 +1342,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
path->stats = jsonStatsFindPath(stats, pathstr.data,
pathstr.len);
- /* Appeend path string entry for array elements, get stats. */
+ /* Append path string entry for array elements, get stats. */
jsonPathAppendEntry(&pathstr, NULL);
pstats = jsonStatsFindPath(stats, pathstr.data, pathstr.len);
freq = jsonPathStatsGetFreq(pstats, 0.0);
@@ -1367,7 +1367,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
case WJB_END_ARRAY:
{
struct Path *p = path;
- /* Absoulte selectivity of the path with its all subpaths */
+ /* Absolute selectivity of the path with its all subpaths */
Selectivity abs_sel = p->sel * p->freq;
/* Pop last path entry */
diff --git a/src/backend/utils/adt/jsonb_typanalyze.c b/src/backend/utils/adt/jsonb_typanalyze.c
index 7882db23a87..9a759aadafb 100644
--- a/src/backend/utils/adt/jsonb_typanalyze.c
+++ b/src/backend/utils/adt/jsonb_typanalyze.c
@@ -123,10 +123,9 @@ typedef struct JsonScalarStats
/*
* Statistics calculated for a set of values.
*
- *
* XXX This seems rather complicated and needs simplification. We're not
* really using all the various JsonScalarStats bits, there's a lot of
- * duplication (e.g. each JsonScalarStats contains it's own array, which
+ * duplication (e.g. each JsonScalarStats contains its own array, which
* has a copy of data from the one in "jsons").
*/
typedef struct JsonValueStats
@@ -849,7 +848,7 @@ jsonAnalyzePathValues(JsonAnalyzeContext *ctx, JsonScalarStats *sstats,
stats->stanullfrac = (float4)(1.0 - freq);
/*
- * Similarly, we need to correct the MCV frequencies, becuse those are
+ * Similarly, we need to correct the MCV frequencies, because those are
* also calculated only from the non-null values. All we need to do is
* simply multiply that with the non-NULL frequency.
*/
@@ -1015,7 +1014,7 @@ jsonAnalyzeBuildPathStats(JsonPathAnlStats *pstats)
/*
* We keep array length stats here for queries like jsonpath '$.size() > 5'.
- * Object lengths stats can be useful for other query lanuages.
+ * Object lengths stats can be useful for other query languages.
*/
if (vstats->arrlens.values.count)
jsonAnalyzeMakeScalarStats(&ps, "array_length", &vstats->arrlens.stats);
@@ -1069,7 +1068,7 @@ jsonAnalyzeCalcPathFreq(JsonAnalyzeContext *ctx, JsonPathAnlStats *pstats,
* We're done with accumulating values for this path, so calculate the
* statistics for the various arrays.
*
- * XXX I wonder if we could introduce some simple heuristict on which
+ * XXX I wonder if we could introduce some simple heuristic on which
* paths to keep, similarly to what we do for MCV lists. For example a
* path that occurred just once is not very interesting, so we could
* decide to ignore it and not build the stats. Although that won't
@@ -1414,7 +1413,7 @@ compute_json_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
/*
* Collect and analyze JSON path values in single or multiple passes.
- * Sigle-pass collection is faster but consumes much more memory than
+ * Single-pass collection is faster but consumes much more memory than
* collecting and analyzing by the one path at pass.
*/
if (ctx.single_pass)
On Fri, 1 Apr 2022 at 20:21, Greg Stark <stark@mit.edu> wrote:
This patch has bitrotted, presumably after the other JSON patchset was
applied. It looks like it's failing in the json header file so it may
be as simple as additional functions added on nearby lines.Please rebase. Reminder, it's the last week of the commitfest so time
is of the essence....
Thanks, Greg for the report.
Here, I am attaching re-based patches of the v05 series. These patches
are re-based on the commit 7dd3ee508432730d15c5.
I noticed some typos.
diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c index f5520f88a1d..d98cd7020a1 100644
Thanks, Justin for the review. We will fix these comments in the next version.
Next, I am going to try to disable all-paths collection and implement
collection of most common paths (and/or hashed paths maybe).
Thanks, Nikita for the v04 series of patches. I tested on the top of
your patches and verified that time taken by analyse is reduced for
large complex json docs.
In v03 patches, it was more than 2 hours, and in v04 patches, it is 39
sec only (time for Tomas's test case).
I am waiting for your patches (disable all-paths collection and
implement collection of most common paths)
Just for testing purposes, I am posting re-based patches here.
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com