Skip partition tuple routing with constant partition key

houzj.fnst@fujitsu.com

about 5 years ago

Hi,

When loading some data into a partitioned table for testing purpose,

I found even if I specified constant value for the partition key[1]--------------------, it still do

the tuple routing for each row.

[1]: --------------------

UPDATE partitioned set part_key = 2 , …

INSERT into partitioned(part_key, ...) select 1, …

---------------------

I saw such SQLs automatically generated by some programs,

So , personally, It’d be better to skip the tuple routing for this case.

IMO, we can use the following steps to skip the tuple routing:

1) collect the column that has constant value in the targetList.

2) compare the constant column with the columns used in partition key.

3) if all the columns used in key are constant then we cache the routed partition

　and do not do the tuple routing again.

In this approach, I did some simple and basic performance tests:

----For plain single column partition key.(partition by range(col)/list(a)...)

When loading 100000000 rows into the table, I can see about 5-7% performance gain

for both cross-partition UPDATE and INSERT if specified constant for the partition key.

----For more complicated expression partition key(partition by range(UDF_func(col)+x)…)

When loading 100000000 rows into the table, it will bring more performance gain.

About > 20% performance gain

Besides, I did not see noticeable performance degradation for other cases(small data set).

Attaching a POC patch about this improvement.

Thoughts ?

Best regards,

houzj

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#1)

Re: Skip partition tuple routing with constant partition key

On Mon, May 17, 2021 at 8:37 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

When loading some data into a partitioned table for testing purpose,

I found even if I specified constant value for the partition key[1], it still do

the tuple routing for each row.

[1]---------------------

UPDATE partitioned set part_key = 2 , …

INSERT into partitioned(part_key, ...) select 1, …

---------------------

I saw such SQLs automatically generated by some programs,

Hmm, does this seem common enough for the added complexity to be worthwhile?

For an example of what's previously been considered worthwhile for a
project like this, see what 0d5f05cde0 did. The cases it addressed
are common enough -- a file being loaded into a (time range-)
partitioned table using COPY FROM tends to have lines belonging to the
same partition consecutively placed.

--
Amit Langote
EDB: http://www.enterprisedb.com

dgrowleyml@gmail.com

about 5 years ago

In reply to: Amit Langote (#2)

Re: Skip partition tuple routing with constant partition key

On Tue, 18 May 2021 at 01:31, Amit Langote <amitlangote09@gmail.com> wrote:

Hmm, does this seem common enough for the added complexity to be worthwhile?

I'd also like to know if there's some genuine use case for this. For
testing purposes does not seem to be quite a good enough reason.

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

IIRC, the patch did something like call ExecPartitionCheck() on the
new tuple with the previously routed to ResultRelInfo. I think the
last used partition was cached somewhere like relcache (which seems a
bit questionable). Likely this would speed up the example case here
a bit. Not as much as the proposed patch, but it would likely apply in
many more cases.

I don't think I ever posted the patch to the list, and if so I no
longer have access to it, so it would need to be done again.

David

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: David Rowley (#3)

RE: Skip partition tuple routing with constant partition key

Hmm, does this seem common enough for the added complexity to be

worthwhile?

I'd also like to know if there's some genuine use case for this. For testing
purposes does not seem to be quite a good enough reason.

Thanks for the response.

For some big data scenario, we sometimes transfer data from one table(only store not expired data)
to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or half a day),
And the data is likely to be imported with date label specified, then all of the data to be
imported this time belong to the same partition which partition by time range.

So, personally, It will be nice if postgres can skip tuple routing for each row in this scenario.

A slightly different optimization that I have considered and even written
patches before was to have ExecFindPartition() cache the last routed to
partition and have it check if the new row can go into that one on the next call.
I imagined there might be a use case for speeding that up for RANGE
partitioned tables since it seems fairly likely that most use cases, at least for
time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it seems most
likely that we'll be routing to a different partition each time and wouldn't save
much since routing to hash partitions are cheaper than other types. LIST
partitioning I'm not so sure about. It seems much less likely than RANGE to hit
the same partition twice in a row.

I think your approach looks good too,
and it seems does not conflict with the approach proposed here.

Best regards,
houzj

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: David Rowley (#3)

Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 01:27:48PM +1200, David Rowley wrote:

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

It depends a lot on the schema used and the load pattern, but I'd like
to think that a similar argument can be made in favor of LIST
partitioning here.
--
Michael

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: David Rowley (#3)

Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 10:28 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 18 May 2021 at 01:31, Amit Langote <amitlangote09@gmail.com> wrote:

Hmm, does this seem common enough for the added complexity to be worthwhile?

I'd also like to know if there's some genuine use case for this. For
testing purposes does not seem to be quite a good enough reason.

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

IIRC, the patch did something like call ExecPartitionCheck() on the
new tuple with the previously routed to ResultRelInfo. I think the
last used partition was cached somewhere like relcache (which seems a
bit questionable). Likely this would speed up the example case here
a bit. Not as much as the proposed patch, but it would likely apply in
many more cases.

I don't think I ever posted the patch to the list, and if so I no
longer have access to it, so it would need to be done again.

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I do see some speedup:

-- creates a range-partitioned table with 1000 partitions
create unlogged table foo (a int) partition by range (a);
select 'create unlogged table foo_' || i || ' partition of foo for
values from (' || (i-1)*100000+1 || ') to (' || i*100000+1 || ');'
from generate_series(1, 1000) i;
\gexec

-- generates a 100 million record file
copy (select generate_series(1, 100000000)) to '/tmp/100m.csv' csv;

Times for loading that file compare as follows:

HEAD:

postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 31813.964 ms (00:31.814)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 31972.942 ms (00:31.973)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 32049.046 ms (00:32.049)

Patched:

postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 26151.158 ms (00:26.151)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 28161.082 ms (00:28.161)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 26700.908 ms (00:26.701)

I guess it would be nice if we could fit in a solution for the use
case that houjz mentioned as a special case. BTW, houjz, could you
please check if a patch like this one helps the case you mentioned?

--
Amit Langote
EDB: http://www.enterprisedb.com

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#4)

Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

Hmm, does this seem common enough for the added complexity to be

worthwhile?

I'd also like to know if there's some genuine use case for this. For testing
purposes does not seem to be quite a good enough reason.

Thanks for the response.

For some big data scenario, we sometimes transfer data from one table(only store not expired data)
to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or half a day),
And the data is likely to be imported with date label specified, then all of the data to be
imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

--
Amit Langote
EDB: http://www.enterprisedb.com

about 5 years ago

In reply to: Amit Langote (#7)

RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

For some big data scenario, we sometimes transfer data from one table(only

store not expired data)

to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or

half a day),

And the data is likely to be imported with date label specified, then all of the

data to be

imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

Is there something you're concerned about? The amount and/or complexity of added code?

Regards
Takayuki Tsunakawa

dgrowleyml@gmail.com

about 5 years ago

In reply to: Amit Langote (#6)

Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 01:17, Amit Langote <amitlangote09@gmail.com> wrote:

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I only had a quick look at this.

+ if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+ dispatch->key->strategy == PARTITION_STRATEGY_RANGE))
+ dispatch->lastPartInfo = rri;

I think you must have meant to have one of these as PARTITION_STRATEGY_LIST?

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

For multi-level partition hierarchies that would still require looping
and checking the cached value at each level.

I've not studied the code that builds and rebuilds PartitionDescData,
so there may be some reason that we shouldn't do that. I know that's
changed a bit recently with DETACH CONCURRENTLY. However, providing
the cached index is not outside the bounds of the oids array, it
shouldn't really matter if the cached value happens to end up pointing
to some other partition. If that happens, we'll just fail the
ExecPartitionCheck() and have to look for the correct partition.

David

#10

dgrowleyml@gmail.com

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#8)

Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 12:20, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

Is there something you're concerned about? The amount and/or complexity of added code?

It would be good to see how close Amit's patch gets to the performance
of the original patch on this thread. As far as I can see, the
difference is, aside from the setup code to determine if the partition
is constant, that Amit's patch just requires an additional
ExecPartitionCheck() call per row. That should be pretty cheap when
compared to the binary search to find the partition for a RANGE or
LIST partitioned table.

Houzj didn't mention how the table in the test was partitioned, so
it's hard to speculate how many comparisons would be done during a
binary search. Or maybe it was HASH partitioned and there was no
binary search.

David

#11

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: David Rowley (#9)

Re: Skip partition tuple routing with constant partition key

On Thu, May 20, 2021 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 20 May 2021 at 01:17, Amit Langote <amitlangote09@gmail.com> wrote:

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I only had a quick look at this.
+ if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+ dispatch->key->strategy == PARTITION_STRATEGY_RANGE))
+ dispatch->lastPartInfo = rri;
I think you must have meant to have one of these as PARTITION_STRATEGY_LIST?

Oops, of course. Fixed in the attached.

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

Hmm, addressing single-row INSERTs with something like you suggest
might help time-range partitioning setups, because each of those
INSERTs are likely to be targeting the same partition most of the
time. Is that case what you had in mind? Although, in the cases
where that doesn't help, we'd end up making a ResultRelInfo for the
cached partition to check the partition constraint, only then to be
thrown away because the new row belongs to a different partition.
That overhead would not be free for sure.

For multi-level partition hierarchies that would still require looping
and checking the cached value at each level.

Yeah, there's no getting around that, though maybe that's not a big problem.

I've not studied the code that builds and rebuilds PartitionDescData,
so there may be some reason that we shouldn't do that. I know that's
changed a bit recently with DETACH CONCURRENTLY. However, providing
the cached index is not outside the bounds of the oids array, it
shouldn't really matter if the cached value happens to end up pointing
to some other partition. If that happens, we'll just fail the
ExecPartitionCheck() and have to look for the correct partition.

Yeah, as long as ExecFindPartition performs ExecPartitionCheck() on
before returning a given cached partition, there's no need to worry
about the cached index getting stale for whatever reason.

--
Amit Langote
EDB: http://www.enterprisedb.com

#12

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#8)

Re: Skip partition tuple routing with constant partition key

On Thu, May 20, 2021 at 9:20 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

For some big data scenario, we sometimes transfer data from one table(only

store not expired data)

to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or

half a day),

And the data is likely to be imported with date label specified, then all of the

data to be

imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

I'm very glad to see such discussions on the list, because it means
the partitioning feature is being stretched to cover wider set of use
cases.

Is there something you're concerned about? The amount and/or complexity of added code?

IMHO, a patch that implements caching more generally would be better
even if it adds some complexity. Hou-san's patch seemed centered
around the use case where all rows being loaded in a given command
route to the same partition, a very specialized case I'd say.

Maybe we can extract the logic in Hou-san's patch to check the
constant-ness of the targetlist producing the rows to insert and find
a way to add it to the patch I posted such that the generality of the
latter's implementation is not lost.

--
Amit Langote
EDB: http://www.enterprisedb.com

#13

dgrowleyml@gmail.com

about 5 years ago

In reply to: Amit Langote (#11)

Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 20:49, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, May 20, 2021 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

Hmm, addressing single-row INSERTs with something like you suggest
might help time-range partitioning setups, because each of those
INSERTs are likely to be targeting the same partition most of the
time. Is that case what you had in mind?

Yeah, I thought it would possibly be useful for RANGE partitioning. I
was a bit undecided with LIST. There seemed to be bigger risk there
that the usage pattern would route to a different partition each time.
In my imagination, RANGE partitioning seems more likely to see
subsequent tuples heading to the same partition as the last tuple.

Although, in the cases
where that doesn't help, we'd end up making a ResultRelInfo for the
cached partition to check the partition constraint, only then to be
thrown away because the new row belongs to a different partition.
That overhead would not be free for sure.

Yeah, there's certainly above zero overhead to getting it wrong. It
would be good to see benchmarks to find out what that overhead is.

David

#14

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#6)

RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>
Sent: Wednesday, May 19, 2021 9:17 PM

I gave a shot to implementing your idea and ended up with the attached PoC
patch, which does pass make check-world.

I do see some speedup:

-- creates a range-partitioned table with 1000 partitions create unlogged table
foo (a int) partition by range (a); select 'create unlogged table foo_' || i || '
partition of foo for values from (' || (i-1)*100000+1 || ') to (' || i*100000+1 || ');'
from generate_series(1, 1000) i;
\gexec

-- generates a 100 million record file
copy (select generate_series(1, 100000000)) to '/tmp/100m.csv' csv;

Times for loading that file compare as follows:

HEAD:

postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 31813.964 ms (00:31.814)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 31972.942 ms (00:31.973)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 32049.046 ms (00:32.049)

Patched:

postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 26151.158 ms (00:26.151)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 28161.082 ms (00:28.161)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 26700.908 ms (00:26.701)

I guess it would be nice if we could fit in a solution for the use case that houjz
mentioned as a special case. BTW, houjz, could you please check if a patch like
this one helps the case you mentioned?

Thanks for the patch!
I did some test on it(using the table you provided above):

1): Test plain column in partition key.
SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD:
Time: 5493.392 ms (00:05.493)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 4484.492 ms (00:04.484)

The test results of your patch in this case looks good.
It can fit many more cases and the performance gain is nice.

-----------
2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted;
create unlogged table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top
In this case, the "key" contains a funcexpr and the funcexpr will be executed
three times for each row, so, it bring extra overhead which cause the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it, we can
Calculate the key value in advance and use the value to do the bound check.
Thoughts ?

------------

Besides, are we going to add a reloption or guc to control this cache behaviour if we more forward with this approach ?
Because, If most of the rows to be inserted are routing to a different partition each time, then I think the extra ExecPartitionCheck
will become the overhead. Maybe it's better to apply both two approaches(cache the last partition and skip constant partition key)
which can achieve the best performance results.

Best regards,
houzj

#15

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#14)

Re: Skip partition tuple routing with constant partition key

Hou-san,

On Thu, May 20, 2021 at 7:35 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Wednesday, May 19, 2021 9:17 PM

I guess it would be nice if we could fit in a solution for the use case that houjz
mentioned as a special case. BTW, houjz, could you please check if a patch like
this one helps the case you mentioned?

Thanks for the patch!
I did some test on it(using the table you provided above):

Thanks a lot for doing that.

1): Test plain column in partition key.
SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD:
Time: 5493.392 ms (00:05.493)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 4484.492 ms (00:04.484)

The test results of your patch in this case looks good.
It can fit many more cases and the performance gain is nice.

Hmm yeah, not too bad.

2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted;
create unlogged table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

Oops.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top
In this case, the "key" contains a funcexpr and the funcexpr will be executed
three times for each row, so, it bring extra overhead which cause the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it, we can
Calculate the key value in advance and use the value to do the bound check.
Thoughts ?

This one seems bit tough. ExecPartitionCheck() uses the generic
expression evaluation machinery like a black box, which means
execPartition.c can't really tweal/control the time spent evaluating
partition constraints. Given that, we may have to disable the caching
when key->partexprs != NIL, unless we can reasonably do what you are
suggesting.

Besides, are we going to add a reloption or guc to control this cache behaviour if we more forward with this approach ?
Because, If most of the rows to be inserted are routing to a different partition each time, then I think the extra ExecPartitionCheck
will become the overhead. Maybe it's better to apply both two approaches(cache the last partition and skip constant partition key)
which can achieve the best performance results.

A reloption will have to be a last resort is what I can say about this
at the moment.

--
Amit Langote
EDB: http://www.enterprisedb.com

#16

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#15)

RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 20, 2021 8:23 PM

Hou-san,

On Thu, May 20, 2021 at 7:35 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted; create unlogged
table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

Oops.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top In
this case, the "key" contains a funcexpr and the funcexpr will be
executed three times for each row, so, it bring extra overhead which cause

the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it,
we can Calculate the key value in advance and use the value to do the bound

check.

Thoughts ?

This one seems bit tough. ExecPartitionCheck() uses the generic expression
evaluation machinery like a black box, which means execPartition.c can't really
tweal/control the time spent evaluating partition constraints. Given that, we
may have to disable the caching when key->partexprs != NIL, unless we can
reasonably do what you are suggesting.[]

I did some research on the CHECK expression that ExecPartitionCheck() execute.
Currently for a normal RANGE partition key it will first generate a CHECK expression
like : [Keyexpression IS NOT NULL AND Keyexpression > lowboud AND Keyexpression < lowboud].
In this case, Keyexpression will be re-executed which will bring some overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression
2)evaluate the key expression in advance
3)pass the result of key expression to do the partition CHECK.
In this way ,we only execute the key expression once which looks more efficient.

Attaching a POC patch about this approach.
I did some performance test with my laptop for this patch:

------------------------------------test cheap partition key expression

create unlogged table test_partitioned_inner (a int) partition by range ((abs(a) + a/50));
create unlogged table test_partitioned_inner_1 partition of test_partitioned_inner for values from (1) to (50);
create unlogged table test_partitioned_inner_2 partition of test_partitioned_inner for values from ( 50 ) to (100);
insert into test_partitioned_inner_1 select (i%48)+1 from generate_series(1,10000000,1) t(i);

BEFORE patch:
Execution Time: 6120.706 ms

AFTER patch:
Execution Time: 5705.967 ms

------------------------------------test expensive partition key expression
create or replace function partfunc(i int) returns int as
$$
begin
return i;
end;
$$ language plpgsql IMMUTABLE;

create unlogged table test_partitioned_inner (a int) partition by range (partfunc (a));
create unlogged table test_partitioned_inner_1 partition of test_partitioned_inner for values from (1) to (50);
create unlogged table test_partitioned_inner_2 partition of test_partitioned_inner for values from ( 50 ) to (100);

I think this can be a independent improvement for partitioncheck.

before patch:
Execution Time: 14048.551 ms

after patch:
Execution Time: 8810.518 ms

I think this patch can solve the performance degradation of key expression
after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Best regards,
houzj

#17

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#16)

RE: Skip partition tuple routing with constant partition key

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key expression after
applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes, I'm in the mood for adopting Amit-san's patch as a base because it's compact and readable, and plus add this patch of yours to complement the partition key function case.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I mistakenly deleted the compile log, but it said something like "There's a redeclaration of PartKeyContext in partcache.h; the original definition is in partdef.h"

* Hmm, this may be too much to expect, but I wonder if we can make the patch more compact...

Regards
Takayuki Tsunakawa

#18

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: tsunakawa.takay@fujitsu.com (#17)

RE: Skip partition tuple routing with constant partition key

From: Tsunakawa, Takayuki <tsunakawa.takay@fujitsu.com>
Sent: Monday, May 24, 2021 3:34 PM

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key
expression after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes, I'm in
the mood for adopting Amit-san's patch as a base because it's compact and
readable, and plus add this patch of yours to complement the partition key
function case.

Thanks for looking into this.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I
mistakenly deleted the compile log, but it said something like "There's a
redeclaration of PartKeyContext in partcache.h; the original definition is in
partdef.h"

It seems a little strange, I have compiled it alone in two different linux machine and did
not find such an error. Did you compile it on a windows machine ?

* Hmm, this may be too much to expect, but I wonder if we can make the patch
more compact...

Of course, I will try to simplify the patch.

Best regards,
houzj

#19

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#18)

RE: Skip partition tuple routing with constant partition key

From: Hou, Zhijie/侯志杰 <houzj.fnst@fujitsu.com>

It seems a little strange, I have compiled it alone in two different linux machine
and did
not find such an error. Did you compile it on a windows machine ?

On Linux, it produces:

gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-s\
tatement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-securit\
y -fno-strict-aliasing -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -\
c -o heap.o heap.c -MMD -MP -MF .deps/heap.Po
In file included from heap.c:86:
../../../src/include/utils/partcache.h:54: error: redefinition of typedef 'Part\
KeyContext'
../../../src/include/partitioning/partdefs.h:26: note: previous declaration of \
'PartKeyContext' was here

Regards
Takayuki Tsunakawa

#20

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#18)

RE: Skip partition tuple routing with constant partition key

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
Sent: Monday, May 24, 2021 3:58 PM

From: Tsunakawa, Takayuki <mailto:tsunakawa.takay@fujitsu.com>
Sent: Monday, May 24, 2021 3:34 PM

From: mailto:houzj.fnst@fujitsu.com <mailto:houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key
expression after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more

cases.

Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes,
I'm in the mood for adopting Amit-san's patch as a base because it's
compact and readable, and plus add this patch of yours to complement
the partition key function case.

Thanks for looking into this.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I
mistakenly deleted the compile log, but it said something like
"There's a redeclaration of PartKeyContext in partcache.h; the
original definition is in partdef.h"

It seems a little strange, I have compiled it alone in two different linux machine
and did not find such an error. Did you compile it on a windows machine ?

Ah, Maybe I found the issue.
Attaching a new patch, please have a try on this patch.

Best regards,
houzj

#21

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#16)

#22

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#20)

#23

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#21)

#24

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#23)

#25

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#24)

#26

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#25)

#27

zyu@yugabyte.com

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#25)

#28

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#26)

#29

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhihong Yu (#27)

#30

zyu@yugabyte.com

about 5 years ago

In reply to: Amit Langote (#29)

#31

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhihong Yu (#30)

#32

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#28)

#33

houzj.fnst@fujitsu.com

about 5 years ago

In reply to: Amit Langote (#32)

#34

Langote_Amit_f8@lab.ntt.co.jp

about 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#33)

#35

houzj.fnst@fujitsu.com

almost 5 years ago

In reply to: Amit Langote (#34)

#36

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#35)

#37

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Amit Langote (#36)

#38

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Amit Langote (#37)

#39

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Amit Langote (#38)

#40

houzj.fnst@fujitsu.com

almost 5 years ago

In reply to: Amit Langote (#39)

#41

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Zhijie Hou (Fujitsu) (#40)

#42

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Amit Langote (#41)

#43

zyu@yugabyte.com

almost 5 years ago

In reply to: Amit Langote (#42)

#44

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Zhihong Yu (#43)

#45

zyu@yugabyte.com

almost 5 years ago

In reply to: Amit Langote (#44)

#46

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Zhihong Yu (#45)

#47

Langote_Amit_f8@lab.ntt.co.jp

almost 5 years ago

In reply to: Amit Langote (#46)

#48

Bruce Momjian

bruce@momjian.us

about 4 years ago

In reply to: Amit Langote (#47)

#49

Langote_Amit_f8@lab.ntt.co.jp

about 4 years ago

In reply to: Bruce Momjian (#48)

#50

zyu@yugabyte.com

about 4 years ago

In reply to: Amit Langote (#49)

#51

Langote_Amit_f8@lab.ntt.co.jp

about 4 years ago

In reply to: Zhihong Yu (#50)

#52

Bruce Momjian

bruce@momjian.us

about 4 years ago

In reply to: Amit Langote (#51)

#53

Langote_Amit_f8@lab.ntt.co.jp

about 4 years ago

In reply to: Bruce Momjian (#52)

#54

Tom Lane

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Amit Langote (#53)

#55

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#54)

#56

Masahiko Sawada

sawada.mshk@gmail.com

about 4 years ago

In reply to: Andres Freund (#55)

#57

Amit Kapila

amit.kapila16@gmail.com

about 4 years ago

In reply to: Masahiko Sawada (#56)

#58

Masahiko Sawada

sawada.mshk@gmail.com

about 4 years ago

In reply to: Amit Kapila (#57)

#59

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Amit Langote (#51)

#60

Langote_Amit_f8@lab.ntt.co.jp

almost 4 years ago

In reply to: David Rowley (#59)

#61

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Amit Langote (#60)

#62

zyu@yugabyte.com

almost 4 years ago

In reply to: David Rowley (#61)

#63

Langote_Amit_f8@lab.ntt.co.jp

almost 4 years ago

In reply to: David Rowley (#61)

#64

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Amit Langote (#63)

#65

Langote_Amit_f8@lab.ntt.co.jp

almost 4 years ago

In reply to: David Rowley (#64)

#66

houzj.fnst@fujitsu.com

almost 4 years ago

In reply to: David Rowley (#64)

#67

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Amit Langote (#65)

#68