Define hash partition for certain column values

Started by Голубева Янаover 5 years ago10 messagesgeneral

ishsha@yandex.ru

over 5 years ago

<div>Hello,</div><div> </div><div>I've found in source code that there is a function satisfies_hash_partition(oid, modulus, remainder, column_values[]) which allows to check if the certain column value will be placed in the certain partition.</div><div>I' d like to know if there is an opportunity not to check the certain partition but to define which partition will be the certain column value placed in. Something like define_hash_partition(oid, column_values[]) which returns the remainder value.</div><div>It is not very convenient to use satisfies_hash_partition for this purpose because there could be a lot of partitions and it takes time to check each one of them.</div><div>If there isn't such opportunity for now I think it would be useful to have such feature in the next releases. What do you think about it?</div><div> </div><div>Thank you!</div><div>Iana Golubeva</div>

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Голубева Яна (#1)

Re: Define hash partition for certain column values

=?utf-8?B?0JPQvtC70YPQsdC10LLQsCDQr9C90LA=?= <ishsha@yandex.ru> writes:

Hello, I've found in source code that there is a function satisfies_hash_partition(oid, modulus, remainder, column_values[]) which allows to check if the certain column value will be placed in the certain partition. I' d like to know if there is an opportunity not to check the certain partition but to define which partition will be the certain column value placed in.

If you want to control what goes where, use list partitioning (or,
perhaps, range partitioning). Hash is only suitable if you do not
care which partition any particular row goes to.

Personally, I think hash partitioning is mostly academic, precisely
because of that. If the partitioning doesn't line up with application
requirements, you give up too much of the benefit of using partitions.

regards, tom lane

Ron

ronljohnsonjr@gmail.com

over 5 years ago

In reply to: Tom Lane (#2)

Re: Define hash partition for certain column values

On 1/11/21 12:36 AM, Tom Lane wrote:

=?utf-8?B?0JPQvtC70YPQsdC10LLQsCDQr9C90LA=?= <ishsha@yandex.ru> writes:

Hello, I've found in source code that there is a function satisfies_hash_partition(oid, modulus, remainder, column_values[]) which allows to check if the certain column value will be placed in the certain partition. I' d like to know if there is an opportunity not to check the certain partition but to define which partition will be the certain column value placed in.

If you want to control what goes where, use list partitioning (or,
perhaps, range partitioning). Hash is only suitable if you do not
care which partition any particular row goes to.

Personally, I think hash partitioning is mostly academic, precisely
because of that. If the partitioning doesn't line up with application
requirements, you give up too much of the benefit of using partitions.

In non-MBCC systems, hash partitioning minimizes lock conflicts because the
writes aren't all going into the same page. OLTP systems can use this
feature to distribute writes across pages; some also allow for "mixed
pages", where records from multiple tables get written to the same page.
(This then means that one DIO is used to read a parent and all it's child
records. Naturally, range reports are *very* slow, but sometimes OLTP
performance is paramount.)

--
Angular momentum makes the world go 'round.

Голубева Яна

ishsha@yandex.ru

over 5 years ago

In reply to: Tom Lane (#2)

Re: Define hash partition for certain column values

<div><div>Thank you for the answer!</div><div> </div><div>List or range partitioning isn't suitable for my case.</div><div>I am using a column of numeric(20) type as a base for partitioning. The values of the column are generated randomly. </div><div>So there will be too many partitions if I use list partitioning as is.</div><div> </div><div>I've tried to use an expression with this column for list partitioning in order to restrict the number of partitions. But in that case I can't create an index because of</div><div>ERROR: unsupported UNIQUE constraint with partition key definition</div><div>DETAIL: UNIQUE constraints cannot be used when partition keys include expressions.</div><div> </div><div>So that the only option I see is to use hash partitioning by this column.</div><div> </div><div>I do not need to control which partition any particular row goes to. But I would like to have an opportunity to prepare pieces of data related to the certain partition. satisfies_hash_partition function is pretty good for this purpose but not for my case.</div><div> </div><div>My partitioning table is filling by insert with a query. The query is pretty complicated and contains several joins of different tables. Data replication into partitioning table occurs regularly.</div><div> </div><div>So, I'd like to create an index which allows me to align data in a basic table with data in the partitioning table based on partitions the data will be placed in. This way I'll be able to manipulate data in different partitions independently. Also I suppose that it could help to improve performance since data refill occurs regularly.</div><div> </div><div>For now I've created a custom C-function define_hash_partition for test and it seems to be working well for my case although I haven't finished my tests yet.</div><div> </div><div>Thank you!</div><div>Iana Golubeva</div></div><div> </div><div><br /></div><div><br /></div><div>11.01.2021, 09:36, "Tom Lane" <tgl@sss.pgh.pa.us>:</div><blockquote><p>=?utf-8?B?0JPQvtC70YPQsdC10LLQsCDQr9C90LA=?= <<a href="mailto:ishsha@yandex.ru">ishsha@yandex.ru</a>> writes:<br /></p><blockquote class="210e7a848e8fcb45wmi-quote"> Hello, I've found in source code that there is a function satisfies_hash_partition(oid, modulus, remainder, column_values[]) which allows to check if the certain column value will be placed in the certain partition. I' d like to know if there is an opportunity not to check the certain partition but to define which partition will be the certain column value placed in.<br /></blockquote><p><br />If you want to control what goes where, use list partitioning (or,<br />perhaps, range partitioning). Hash is only suitable if you do not<br />care which partition any particular row goes to.<br /><br />Personally, I think hash partitioning is mostly academic, precisely<br />because of that. If the partitioning doesn't line up with application<br />requirements, you give up too much of the benefit of using partitions.<br /><br /> regards, tom lane<br /></p></blockquote>

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Голубева Яна (#4)

Re: Define hash partition for certain column values

On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <ishsha@yandex.ru> wrote:

List or range partitioning isn't suitable for my case.
I am using a column of numeric(20) type as a base for partitioning. The
values of the column are generated randomly.
So there will be too many partitions if I use list partitioning as is.

Sorry, but why is range not suited for this? It would seem fairly trivial
to create 50 or 1000 partitions to break up the range of values allowed by
your field definition.

Голубева Яна

ishsha@yandex.ru

over 5 years ago

In reply to: Michael Lewis (#5)

Re: Define hash partition for certain column values

<div>Values for the key partitioning column are generated randomly and I can't predict their distribution between ranges.</div><div>If I just create some ranges I won't have any guarantee that partitions will have similar amount of data. It is possible that I will have 2 or 3 extremely big partitions and a bit of data in others.</div><div> </div><div>Thank you,</div><div>Iana Golubeva</div><div><br /></div><div><br /></div><div>12.01.2021, 17:55, "Michael Lewis" <mlewis@entrata.com>:</div><blockquote><div dir="ltr"><div dir="ltr">On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <<a href="mailto:ishsha@yandex.ru">ishsha@yandex.ru</a>> wrote:<br /></div><div class="f13ca48719c8a60033905b23b39675agmail_quote"><blockquote class="f13ca48719c8a60033905b23b39675agmail_quote" style="border-left-color:rgb( 204 , 204 , 204 );border-left-style:solid;border-left-width:1px;margin:0px 0px 0px 0.8ex;padding-left:1ex"><div>List or range partitioning isn't suitable for my case.<br /></div><div>I am using a column of numeric(20) type as a base for partitioning. The values of the column are generated randomly. </div><div>So there will be too many partitions if I use list partitioning as is.<br /></div></blockquote><div><br /></div><div>Sorry, but why is range not suited for this? It would seem fairly trivial to create 50 or 1000 partitions to break up the range of values allowed by your field definition.</div></div></div>
</blockquote>

Rob Sargent

robjsargent@gmail.com

over 5 years ago

In reply to: Голубева Яна (#6)

Re: Define hash partition for certain column values

On 1/12/21 8:51 AM, Голубева Яна wrote:

Values for the key partitioning column are generated randomly and I
can't predict their distribution between ranges.
If I just create some ranges I won't have any guarantee that
partitions will have similar amount of data. It is possible that I
will have 2 or 3 extremely big partitions and a bit of data in others.
Thank you,
Iana Golubeva

12.01.2021, 17:55, "Michael Lewis" <mlewis@entrata.com>:

On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <ishsha@yandex.ru
<mailto:ishsha@yandex.ru>> wrote:

List or range partitioning isn't suitable for my case.
I am using a column of numeric(20) type as a base for
partitioning. The values of the column are generated randomly.
So there will be too many partitions if I use list
partitioning as is.

Sorry, but why is range not suited for this? It would seem fairly
trivial to create 50 or 1000 partitions to break up the range of
values allowed by your field definition.

Doesn't sound like a good definition of random.

Alban Hertroys

haramrae@gmail.com

over 5 years ago

In reply to: Голубева Яна (#6)

Re: Define hash partition for certain column values

On 12 Jan 2021, at 16:51, Голубева Яна <ishsha@yandex.ru> wrote:

Values for the key partitioning column are generated randomly and I can't predict their distribution between ranges.
If I just create some ranges I won't have any guarantee that partitions will have similar amount of data. It is possible that I will have 2 or 3 extremely big partitions and a bit of data in others.

A hash of a random number is also random, so when using hashes for partitioning you will get the same problem.

If you want to distribute values equally over a fixed number of partitions, I suggest you partition on a modulo of a monotonously increasing number (a sequence for example), instead of relying on a random number.

12.01.2021, 17:55, "Michael Lewis" <mlewis@entrata.com>:
On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <ishsha@yandex.ru> wrote:
List or range partitioning isn't suitable for my case.
I am using a column of numeric(20) type as a base for partitioning. The values of the column are generated randomly.
So there will be too many partitions if I use list partitioning as is.

Sorry, but why is range not suited for this? It would seem fairly trivial to create 50 or 1000 partitions to break up the range of values allowed by your field definition.

Alban Hertroys
--
There is always an exception to always.

Michael Lewis

mlewis@entrata.com

over 5 years ago

In reply to: Alban Hertroys (#8)

Re: Define hash partition for certain column values

On Tue, Jan 12, 2021 at 9:37 AM Alban Hertroys <haramrae@gmail.com> wrote:

On 12 Jan 2021, at 16:51, Голубева Яна <ishsha@yandex.ru> wrote:

Values for the key partitioning column are generated randomly and I

can't predict their distribution between ranges.

If I just create some ranges I won't have any guarantee that partitions

will have similar amount of data. It is possible that I will have 2 or 3
extremely big partitions and a bit of data in others.

A hash of a random number is also random, so when using hashes for
partitioning you will get the same problem.

If you want to distribute values equally over a fixed number of
partitions, I suggest you partition on a modulo of a monotonously
increasing number (a sequence for example), instead of relying on a random
number.

12.01.2021, 17:55, "Michael Lewis" <mlewis@entrata.com>:
On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <ishsha@yandex.ru> wrote:
List or range partitioning isn't suitable for my case.
I am using a column of numeric(20) type as a base for partitioning. The

values of the column are generated randomly.

So there will be too many partitions if I use list partitioning as is.

Sorry, but why is range not suited for this? It would seem fairly

trivial to create 50 or 1000 partitions to break up the range of values
allowed by your field definition.

Alban Hertroys

That said, there is no reason you should need near-perfectly-even
distribution anyway. You can also split partitions later, or do another
level of partitioning on large partitions if they somehow end up
significantly unbalanced.

How many rows are we talking about initially/over time? Do you plan to drop
old data at all? Perhaps the initial decision to partition was decided on a
bit too hastily.

#10

Голубева Яна

ishsha@yandex.ru

over 5 years ago

In reply to: Michael Lewis (#9)

Re: Define hash partition for certain column values

<div><div><div>Sorry, I think I 've described my case not precisely enough.</div><div> </div><div>"Randomly" is not pure random in my case.</div><div> </div><div>My solution is planning to be used on different servers with different DBs. The initial data of the base table depends on the DB. But I know that the key value of the new rows is increasing. Not monotonously, but still.</div><div>I need a common solution for all DBs.</div><div> </div><div>The size of a base table could be very different (from millions to hundreds of billions). For tests I've used 2 different dumps.</div><div> </div><div>Ranges that were suitable for the first dump show for the second the situation like I've described (2-3 partitions with 95% of data). And vice versa.</div><div>Besides, permanent increasing of key value of new rows means that some ranges will be permanently increasing </div><div>meanwhile others will have the same amount of data or even less (outdated data is clearing).</div><div> </div><div>Hash partitioning shows that I will have partitions with not exactly the same size of data but similar enough. And this result is actual for both dumps.</div><div> </div><div>So that I've decided to use hash partitioning.</div><div> </div><div>Thank you,</div><div>Iana Golubeva</div></div></div><div><br /></div><div><br /></div><div>12.01.2021, 19:41, "Michael Lewis" <mlewis@entrata.com>:</div><blockquote><div dir="ltr"><div dir="ltr">On Tue, Jan 12, 2021 at 9:37 AM Alban Hertroys <<a href="mailto:haramrae@gmail.com">haramrae@gmail.com</a>> wrote:<br /></div><div class="f13ca48719c8a60033905b23b39675agmail_quote"><blockquote class="f13ca48719c8a60033905b23b39675agmail_quote" style="border-left-color:rgb( 204 , 204 , 204 );border-left-style:solid;border-left-width:1px;margin:0px 0px 0px 0.8ex;padding-left:1ex"><br />
> On 12 Jan 2021, at 16:51, Голубева Яна <<a href="mailto:ishsha@yandex.ru" target="_blank">ishsha@yandex.ru</a>> wrote:<br />
> <br />
> Values for the key partitioning column are generated randomly and I can't predict their distribution between ranges.<br />
> If I just create some ranges I won't have any guarantee that partitions will have similar amount of data. It is possible that I will have 2 or 3 extremely big partitions and a bit of data in others.<br />
<br />
A hash of a random number is also random, so when using hashes for partitioning you will get the same problem.<br />
<br />
If you want to distribute values equally over a fixed number of partitions, I suggest you partition on a modulo of a monotonously increasing number (a sequence for example), instead of relying on a random number.<br />
<br />
> 12.01.2021, 17:55, "Michael Lewis" <<a href="mailto:mlewis@entrata.com" target="_blank">mlewis@entrata.com</a>>:<br />
> On Tue, Jan 12, 2021 at 1:21 AM Голубева Яна <<a href="mailto:ishsha@yandex.ru" target="_blank">ishsha@yandex.ru</a>> wrote:<br />
> List or range partitioning isn't suitable for my case.<br />
> I am using a column of numeric(20) type as a base for partitioning. The values of the column are generated randomly. <br />
> So there will be too many partitions if I use list partitioning as is.<br />
> <br />
> Sorry, but why is range not suited for this? It would seem fairly trivial to create 50 or 1000 partitions to break up the range of values allowed by your field definition.<br />
<br />
Alban Hertroys</blockquote><div><br /></div><div>That said, there is no reason you should need near-perfectly-even distribution anyway. You can also split partitions later, or do another level of partitioning on large partitions if they somehow end up significantly unbalanced.</div><div><br /></div><div>How many rows are we talking about initially/over time? Do you plan to drop old data at all? Perhaps the initial decision to partition was decided on a bit too hastily.</div></div></div>
</blockquote>