WTF with hash index?

Started by Олег Самойловover 7 years ago10 messagesgeneral
Jump to latest

CentOS 7

$ rpm -q postgresql10
postgresql10-10.6-1PGDG.rhel7.x86_64

SQL script for psql:

\set table_size 1000000
begin;
create table gender (gender varchar);

insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end from (select random() as random, generate_series(1,:table_size)) as subselect;

create index gender_btree on gender using btree (gender);
create index gender_hash on gender using hash (gender);
commit;
vacuum full analyze;

Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much smaller and quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
(1 row)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+-------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
(2 rows)

The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with the hash index?

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Олег Самойлов (#1)
Re: WTF with hash index?

Олег Самойлов wrote:

\set table_size 1000000
begin;
create table gender (gender varchar);

insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end from (select random() as random, generate_series(1,:table_size)) as subselect;

create index gender_btree on gender using btree (gender);
create index gender_hash on gender using hash (gender);
commit;
vacuum full analyze;

Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much smaller and quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
(1 row)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+-------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
(2 rows)

The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with the hash index?

I guess the problem here is that there are so few distinct values, so
all the index items end up in only three hash buckets, forming large
linked lists.

I can't tell off-hand why that would make the index so large though.

Anyway, indexes are pretty useless in such a case.

Is the behavior the same if you have many distinct values?

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#3Andreas Kretschmer
andreas@a-kretschmer.de
In reply to: Олег Самойлов (#1)
Re: WTF with hash index?

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

Regards, Andreas
--

2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

#4Ron
ronljohnsonjr@gmail.com
In reply to: Andreas Kretschmer (#3)
Re: WTF with hash index?

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
            else 'other' end
from (select random() as random, generate_series(1,:table_size)) as subselect;

Shouldn't that make 49% male?

--
Angular momentum makes the world go 'round.

#5Andreas Kretschmer
andreas@a-kretschmer.de
In reply to: Ron (#4)
Re: WTF with hash index?

Am 13.11.2018 um 19:12 schrieb Ron:

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then
'female' when random<0.99 then 'male' else 'other' end from (select
random() as random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
            else 'other' end
from (select random() as random, generate_series(1,:table_size)) as
subselect;

Shouldn't that make 49% male?

you are right, my fault :-(.
the case - statement will be left if the first condition is true.

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

#6Ron
ronljohnsonjr@gmail.com
In reply to: Andreas Kretschmer (#5)
Re: WTF with hash index?

On 11/13/2018 12:39 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 19:12 schrieb Ron:

On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:

insert into gender (gender) select case when random<0.50 then 'female'
when random<0.99 then 'male' else 'other' end from (select random() as
random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
            else 'other' end
from (select random() as random, generate_series(1,:table_size)) as
subselect;

Shouldn't that make 49% male?

you are right, my fault :-(.
the case - statement will be left if the first condition is true.

I had to read it thrice. :)  It's an example of why I like white space and
indenting...

--
Angular momentum makes the world go 'round.

In reply to: Laurenz Albe (#2)
Re: WTF with hash index?

I am just doing experiment what a type a most suitable for enumeration in PostgreSQL. And what index. And this effect looked for me very strange. There is in the PostgreSQL one another hash index. This is gin(jsonb_path_ops) for the jsob type. It is also use hash internally, but it is much better.
Example based on the previous example.

create table jender (jdoc jsonb);

insert into jender (jdoc) select ('{"gender": "'||gender||'"}')::jsonb from gender;

create index jender_hash on jender using gin (jdoc jsonb_path_ops);

=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------+-------+-------+-------+-------------
public | gender | table | olleg | 35 MB |
public | jender | table | olleg | 54 MB |
(2 rows)

=> \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+--------------+-------+-------+--------+---------+-------------
public | gender_btree | index | olleg | gender | 21 MB |
public | gender_hash | index | olleg | gender | 47 MB |
public | jender_hash | index | olleg | jender | 1104 kB |
(3 rows)

Very much better. What about to copy paste algorithm from gin(jsonb_path_ops) to the hash index?

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Олег Самойлов (#7)
Re: WTF with hash index?

On 2018-Nov-13, Олег Самойлов wrote:

Very much better. What about to copy paste algorithm from
gin(jsonb_path_ops) to the hash index?

You're welcome to submit patches.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Alvaro Herrera (#8)
Re: WTF with hash index?

Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the same data and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to considerably improve your product.

Show quoted text

13 нояб. 2018 г., в 22:37, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):

On 2018-Nov-13, Олег Самойлов wrote:

Very much better. What about to copy paste algorithm from
gin(jsonb_path_ops) to the hash index?

You're welcome to submit patches.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Stephen Frost
sfrost@snowman.net
In reply to: Олег Самойлов (#9)
Re: WTF with hash index?

Greetings,

* Олег Самойлов (splarv@ya.ru) wrote:

Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the same data and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to considerably improve your product.

Ultimately, it might be less improvement overall and more of a
regression though, in the common cases.

This use-case isn't common for a hash index. If we changed all hash
indexes to instead be gin indexes of jsonb values, we'd end up with the
common hash index use-cases being much, much worse.

In the end, this comes down to choosing the right index for your data
and your queries, which is actually a rather difficult problem to handle
in some kind of automated way as there are lots of different trade-offs
to consider.

Thanks!

Stephen