Improvement for query planner? (no, not about count(*) again ;-))

Started by Tobias Völkover 5 years ago5 messagesbugsgeneral
Jump to latest
#1Tobias Völk
tobias.voelk@t-online.de
bugsgeneral

Hello Postgres-Community,

I’ve got a table games(name1 text, name2 text) with 1.3x10^9 rows consisting of two two text columns for the names of players who’ve played a game, duplicate rows are possible, there’s no primary key since this table was just intended as a temporary storage for my data until further processing.

The length of a name is usually not more than 20 characters, shorter most of the time.
I’ve asked postgres to make an unlogged newtable(name text primary key) consisting of the unqiue names and executed:

Insert into newtable(name) select name1 from games on conflict do nothing;
(and later on intended to do the same for the second column)

However after hours it still wasn’t done, used only 1 cpu core to the max and read with 5 MB/s from my fast SSD.
So I stopped it.
I’ve also tried inserting (select name1 from games union select name2 from games) but it always wanted to do it using sorting.
But either the sorting or the preperations for the sorting were again only done using 1 core to the max and reading with 5 MB/s.

Couldn’t find a fast query for my problem.

So I wrote a java-program which read the whole table at a fetchsize of about 4 million and inserted the names into a HashSet.
And surprisingly after only a few minutes the program was already 25% done o.O

My question is, why isn’t postgres nearly this fast? Why doesn’t it just create a HashSet in RAM and read full speed from the disk?
I even created a hash index but it kept using it’s primary key b-tree and then I read that hash indices somehow don’t support checking for uniqueness.

Best regards, Tobi

--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

#2Francisco Olarte
folarte@peoplecall.com
In reply to: Tobias Völk (#1)
bugsgeneral
Re: Improvement for query planner? (no, not about count(*) again ;-))

Tobias:

On Mon, Jul 20, 2020 at 12:09 PM Tobias Völk <tobias.voelk@t-online.de> wrote:
...

Insert into newtable(name) select name1 from games on conflict do nothing;
(and later on intended to do the same for the second column)
However after hours it still wasn’t done, used only 1 cpu core to the max and read with 5 MB/s from my fast SSD.
I’ve also tried inserting (select name1 from games union select name2 from games) but it always wanted to do it using sorting.
But either the sorting or the preperations for the sorting were again only done using 1 core to the max and reading with 5 MB/s.
Couldn’t find a fast query for my problem.
So I wrote a java-program which read the whole table at a fetchsize of about 4 million and inserted the names into a HashSet.
And surprisingly after only a few minutes the program was already 25% done o.O

Not surprising, 1.3E9 rows, lets say 400M for 25%, SSD, if your net is
fast enough pg should be able to send you a million rows a second to
do that.

Have you tried doing a similar thing in postgres, like "select
distinct name1", or select distinct name1 union select distinct
name2". The distinct part is the equivalent of putting everything in a
hash set.

The part of doing it using sorting, maybe you do not have enough
work_mem or other things, but it is probably the right way to do it
under the constraints you have put on the engine, but I would not
bother with that before timing. I routinely sort huge files ( not in
pg ) spilling to disk ( they are about a hundred times available RAM )
sort a few gigabytes, spill, read and merge in multi megabytes chunks,
and it only makes a constant factor (2, IIRC, ram sort is
read+sort+write, spill is read+write chunks + read chunks + write )
when testing against pure ram ( with a file which fit in ram ).

My question is, why isn’t postgres nearly this fast? Why doesn’t it just create a HashSet in RAM and read full speed from the disk?

The on conflict version may be slow because it is not optimized for
this kind of things, and is doing nothing but testing for conflict on
every row ( which may be needed ). Also, you have set a primary key
before a bulk load, which is a big no-no as pg has to build the index
as it loads.

I would try to do the equivalent of the hash set, create the table
without the PK, then try something like "select distint name1 union
select distinct name2", which is similar to build two hashsets and
collapse them, then add the primary key afterwards. Test it in steps.

I even created a hash index but it kept using it’s primary key b-tree and then I read that hash indices somehow don’t support checking for uniqueness.

Also more indexes => slower loading.

Francisco Olarte.

#3Francisco Olarte
folarte@peoplecall.com
In reply to: Francisco Olarte (#2)
bugsgeneral
Re: Improvement for query planner? (no, not about count(*) again ;-))

Tobias, 1st some etiquette stuff.

- You have replied just to me, directly. I'm CCing the list. Remember
to use reply all. Usual practice in the postgres lists is to reply to
the list and everyone involved in the thread ( doing reply all
achieves this normally ).

- It's not a biggie in this particular mail, but please do not
top-post, specially if you want to get answers on any complex
question. Trim unnecessary parts from the quoted text and reply below.
Having to scroll to a big chunk which includes even my signature is
not a thing I like.

On Mon, Jul 20, 2020 at 2:23 PM <tobias.voelk@t-online.de> wrote:

I have tried the queries
select name1 from games union select name2 from games
but not
select distinct name1 from games union
select distinct name2 from games
since it's just the same and easy for the optimizer to realize (I thought?)

There are several other things you have not done. You have not
provided any info ( statistics ) on your table, just the cardinality.
You have not provided any explain output, which is normally needed if
you really want people to help you.

On the subject, I'm not really sure both queries are identical or can
be optimized, but when one does bulk queries like that it is better to
help it. Your query is a corner case, a one of loading, and many
optimizers do not catch that things properly, as putting code for them
means increasing bug surface on a feature of dubious utility ( I,
personally, would prefer having to put your sample query manually than
risking optimizer bugs OR paying the price of the optimizer trying to
catch that on every query I send ).

Also, the optimizer may be catching many things, a explain output may
help to see what it's doing.

I've given Postgres a few Gigs (I think 4?) as work_mem, having 16 GB in total. Still it's not using them.

This does not seem correct. work_mem is per operation, it can be used
several times on a single query. See
https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
. Again, some explain/show combos may clear up things.

Seems like my mistake was creating that table with a primary key. But the query itself without inserting into anything should've been fast then, which it wasn't. I'll remember your trick of creating the primary key afterwards and will try just select name1 from games to see how it goes.

The PK stuff is bulk-loading 101. Try explain AND explain analyze of
some variants, remember to analyze your tables ( seems redundant, but
the PK & redundant hash key stuff leads me to think you are not too
experienced on postgres usage ).

Francisco Olarte.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tobias Völk (#1)
bugsgeneral
Re: Improvement for query planner? (no, not about count(*) again ;-))

=?utf-8?Q?Tobias_V=C3=B6lk?= <tobias.voelk@t-online.de> writes:

I’ve asked postgres to make an unlogged newtable(name text primary key) consisting of the unqiue names and executed:

Insert into newtable(name) select name1 from games on conflict do nothing;

ON CONFLICT is a really, really expensive way to eliminate duplicates.
It's meant to handle situations where two or more sessions might
concurrently insert duplicate keys, which means that (a) there's not
really any way to detect the situation in advance or optimize it,
and (b) we don't expect it to happen that much anyhow.

You'd be better off with something like

insert into newtable(name) select distinct name1 from games;

or

insert into newtable(name) select name1 from games group by name1;

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
bugsgeneral
Re: Improvement for query planner? (no, not about count(*) again ;-))

Hi,

On 2020-07-20 13:58:19 -0400, Tom Lane wrote:

=?utf-8?Q?Tobias_V=C3=B6lk?= <tobias.voelk@t-online.de> writes:

I’ve asked postgres to make an unlogged newtable(name text primary key) consisting of the unqiue names and executed:

Insert into newtable(name) select name1 from games on conflict do nothing;

ON CONFLICT is a really, really expensive way to eliminate duplicates.
It's meant to handle situations where two or more sessions might
concurrently insert duplicate keys, which means that (a) there's not
really any way to detect the situation in advance or optimize it,
and (b) we don't expect it to happen that much anyhow.

And it's explicitly not about handling conflicts between rows inserted
in the same statement. In fact, one gets an error when using ON
CONFLICT .. DO UPDATE affects a row modified in the same statement:

CREATE TABLE conflict(key text primary key, data text not null);
INSERT INTO conflict VALUES ('a', 'a1'),('a', 'a2'),('b', 'b2') ON CONFLICT (key) DO UPDATE set data = excluded.data;
ERROR: 21000: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
LOCATION: ExecOnConflictUpdate, nodeModifyTable.c:1590
Time: 1.174 ms

Greetings,

Andres Freund