WTF with hash index?

Started by Олег Самойловover 7 years ago10 messagesgeneral

Олег Самойлов

splarv@ya.ru

over 7 years ago

CentOS 7

$ rpm -q postgresql10
postgresql10-10.6-1PGDG.rhel7.x86_64

SQL script for psql:

\set table_size 1000000
begin;
create table gender (gender varchar);

insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end from (select random() as random, generate_series(1,:table_size)) as subselect;

create index gender_btree on gender using btree (gender);
create index gender_hash on gender using hash (gender);
commit;
vacuum full analyze;

Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much smaller and quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
(1 row)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+-------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
(2 rows)

The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with the hash index?

laurenz.albe@cybertec.at

over 7 years ago

In reply to: Олег Самойлов (#1)

Re: WTF with hash index?

Олег Самойлов wrote:

\set table_size 1000000
begin;
create table gender (gender varchar);

insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end from (select random() as random, generate_series(1,:table_size)) as subselect;

create index gender_btree on gender using btree (gender);
create index gender_hash on gender using hash (gender);
commit;
vacuum full analyze;

Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much smaller and quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
(1 row)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+-------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
(2 rows)

The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with the hash index?

I guess the problem here is that there are so few distinct values, so
all the index items end up in only three hash buckets, forming large
linked lists.

I can't tell off-hand why that would make the index so large though.

Anyway, indexes are pretty useless in such a case.

Is the behavior the same if you have many distinct values?

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Andreas Kretschmer

andreas@a-kretschmer.de

over 7 years ago

In reply to: Олег Самойлов (#1)

Re: WTF with hash index?

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

Regards, Andreas
--

2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

ronljohnsonjr@gmail.com

over 7 years ago

In reply to: Andreas Kretschmer (#3)

Re: WTF with hash index?

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
else 'other' end
from (select random() as random, generate_series(1,:table_size)) as subselect;

Shouldn't that make 49% male?

--
Angular momentum makes the world go 'round.

Andreas Kretschmer

andreas@a-kretschmer.de

over 7 years ago

In reply to: Ron (#4)

Re: WTF with hash index?

Am 13.11.2018 um 19:12 schrieb Ron:

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then
'female' when random<0.99 then 'male' else 'other' end from (select
random() as random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
else 'other' end
from (select random() as random, generate_series(1,:table_size)) as
subselect;

Shouldn't that make 49% male?

you are right, my fault :-(.
the case - statement will be left if the first condition is true.

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

ronljohnsonjr@gmail.com

over 7 years ago

In reply to: Andreas Kretschmer (#5)

Re: WTF with hash index?

On 11/13/2018 12:39 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 19:12 schrieb Ron:

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
else 'other' end
from (select random() as random, generate_series(1,:table_size)) as
subselect;

Shouldn't that make 49% male?

you are right, my fault :-(.
the case - statement will be left if the first condition is true.

I had to read it thrice. :) It's an example of why I like white space and
indenting...

--
Angular momentum makes the world go 'round.

Олег Самойлов

splarv@ya.ru

over 7 years ago

In reply to: Laurenz Albe (#2)

Re: WTF with hash index?

I am just doing experiment what a type a most suitable for enumeration in PostgreSQL. And what index. And this effect looked for me very strange. There is in the PostgreSQL one another hash index. This is gin(jsonb_path_ops) for the jsob type. It is also use hash internally, but it is much better.
Example based on the previous example.

create table jender (jdoc jsonb);

insert into jender (jdoc) select ('{"gender": "'||gender||'"}')::jsonb from gender;

create index jender_hash on jender using gin (jdoc jsonb_path_ops);

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
public | jender | table | olleg | 54 MB |
(2 rows)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+---------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
public | jender_hash | index | olleg | jender | 1104 kB |
(3 rows)

Very much better. What about to copy paste algorithm from gin(jsonb_path_ops) to the hash index?

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Олег Самойлов (#7)

Re: WTF with hash index?

On 2018-Nov-13, Олег Самойлов wrote:

Very much better. What about to copy paste algorithm from
gin(jsonb_path_ops) to the hash index?

You're welcome to submit patches.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Олег Самойлов

splarv@ya.ru

over 7 years ago

In reply to: Alvaro Herrera (#8)

Re: WTF with hash index?

Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the same data and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to considerably improve your product.

Show quoted text

13 нояб. 2018 г., в 22:37, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):

On 2018-Nov-13, Олег Самойлов wrote:

Very much better. What about to copy paste algorithm from
gin(jsonb_path_ops) to the hash index?

You're welcome to submit patches.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

sfrost@snowman.net

over 7 years ago

In reply to: Олег Самойлов (#9)

Re: WTF with hash index?

Greetings,

* Олег Самойлов (splarv@ya.ru) wrote:

Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the same data and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to considerably improve your product.

Ultimately, it might be less improvement overall and more of a
regression though, in the common cases.

This use-case isn't common for a hash index. If we changed all hash
indexes to instead be gin indexes of jsonb values, we'd end up with the
common hash index use-cases being much, much worse.

In the end, this comes down to choosing the right index for your data
and your queries, which is actually a rather difficult problem to handle
in some kind of automated way as there are lots of different trade-offs
to consider.

Thanks!

Stephen