OOM in hash join

Started by Konstantin Knizhnikover 2 years ago5 messages
#1Konstantin Knizhnik
knizhnik@garret.ru

Hi hackers,

Too small value of work_mem cause memory overflow in parallel hash join
because of too much number batches.
There is the plan:

explain SELECT * FROM solixschema.MIG_50GB_APR04_G1_H a join
solixschema.MIG_50GB_APR04_G2_H b on a.seq_pk = b.seq_pk join
solixschema.MIG_50GB_APR04_G3_H c on b.seq_p
k = c.seq_pk join solixschema.MIG_50GB_APR04_G4_H d on c.seq_pk =
d.seq_pk join solixschema.MIG_50GB_APR04_G5_H e on d.seq_pk = e.seq_pk;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=205209076.76..598109290.40 rows=121319744 width=63084)
   Workers Planned: 8
   ->  Parallel Hash Join  (cost=205208076.76..585976316.00
rows=15164968 width=63084)
         Hash Cond: (b.seq_pk = a.seq_pk)
         ->  Parallel Hash Join  (cost=55621683.59..251148173.17
rows=14936978 width=37851)
               Hash Cond: (b.seq_pk = c.seq_pk)
               ->  Parallel Hash Join (cost=27797595.68..104604780.40
rows=15346430 width=25234)
                     Hash Cond: (b.seq_pk = d.seq_pk)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g2_h b 
(cost=0.00..4021793.90 rows=15783190 width=12617)
                     ->  Parallel Hash (cost=3911716.30..3911716.30
rows=15346430 width=12617)
                           ->  Parallel Seq Scan on mig_50gb_apr04_g4_h
d  (cost=0.00..3911716.30 rows=15346430 width=12617)
               ->  Parallel Hash  (cost=3913841.85..3913841.85
rows=15362085 width=12617)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g3_h c 
(cost=0.00..3913841.85 rows=15362085 width=12617)
         ->  Parallel Hash  (cost=102628306.07..102628306.07
rows=15164968 width=25233)
               ->  Parallel Hash Join (cost=27848049.61..102628306.07
rows=15164968 width=25233)
                     Hash Cond: (a.seq_pk = e.seq_pk)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g1_h a 
(cost=0.00..3877018.68 rows=15164968 width=12617)
                     ->  Parallel Hash (cost=3921510.05..3921510.05
rows=15382205 width=12616)
                           ->  Parallel Seq Scan on mig_50gb_apr04_g5_h
e  (cost=0.00..3921510.05 rows=15382205 width=12616)

work_mem is 4MB and leader + two parallel workers consumes about 10Gb each.
There are 262144 batches:

(gdb) p *hjstate->hj_HashTable
$2 = {nbuckets = 1024, log2_nbuckets = 10, nbuckets_original = 1024,
  nbuckets_optimal = 1024, log2_nbuckets_optimal = 10, buckets = {
    unshared = 0x7fa5d5211000, shared = 0x7fa5d5211000}, keepNulls =
false,
  skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0,
nSkewBuckets = 0,
  skewBucketNums = 0x0, nbatch = 262144, curbatch = 86506,
  nbatch_original = 262144, nbatch_outstart = 262144, growEnabled = true,
  totalTuples = 122600000, partialTuples = 61136408, skewTuples = 0,
  innerBatchFile = 0x0, outerBatchFile = 0x0,
  outer_hashfunctions = 0x55ce086a3288, inner_hashfunctions =
0x55ce086a32d8,
  hashStrict = 0x55ce086a3328, collations = 0x55ce086a3340, spaceUsed = 0,
  spaceAllowed = 8388608, spacePeak = 204800, spaceUsedSkew = 0,
  spaceAllowedSkew = 167772, hashCxt = 0x55ce086a3170,
  batchCxt = 0x55ce086a5180, chunks = 0x0, current_chunk = 0x7fa5d5283000,
  area = 0x55ce085b56d8, parallel_state = 0x7fa5ee993520,
  batches = 0x7fa5d3ff8048, current_chunk_shared = 1099512193024}

The biggest memory contexts are:

ExecutorState: 1362623568
   HashTableContext: 102760280
    HashBatchContext: 7968
   HashTableContext: 178257752
    HashBatchContext: 7968
   HashTableContext: 5306745728
    HashBatchContext: 7968

There is still some gap between size reported by memory context sump and
actual size of backend.
But is seems to be obvious, that trying to fit in work_mem
sharedtuplestore creates so much batches, that  them consume much more
memory than work_mem.

#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: OOM in hash join

On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Hi hackers,

Too small value of work_mem cause memory overflow in parallel hash join
because of too much number batches.
There is the plan:

[...]

There is still some gap between size reported by memory context sump and
actual size of backend.
But is seems to be obvious, that trying to fit in work_mem
sharedtuplestore creates so much batches, that them consume much more
memory than work_mem.

The same issue [0]/messages/by-id/20230228190643.1e368315@karst was reported a few weeks ago, with the same
diagnosis here [1]/messages/by-id/3013398b-316c-638f-2a73-3783e8e2ef02@enterprisedb.com. I think it's being worked on over there.

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/20230228190643.1e368315@karst
[1]: /messages/by-id/3013398b-316c-638f-2a73-3783e8e2ef02@enterprisedb.com

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: OOM in hash join

On Fri, Apr 14, 2023 at 10:59 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Too small value of work_mem cause memory overflow in parallel hash join
because of too much number batches.

Yeah. Not only in parallel hash join, but in any hash join
(admittedly parallel hash join has higher per-batch overheads; that is
perhaps something we could improve). That's why we tried to invent an
alternative strategy where you loop over batches N times, instead of
making more batches, at some point:

/messages/by-id/CA+hUKGKWWmf=WELLG=aUGbcugRaSQbtm0tKYiBut-B2rVKX63g@mail.gmail.com

That thread starts out talking about 'extreme skew' etc but the more
general problem is that, at some point, even with perfectly evenly
distributed keys, adding more batches requires more memory than you
can save by doing so. Sure, it's a problem that we don't account for
that memory properly, as complained about here:

/messages/by-id/20190504003414.bulcbnge3rhwhcsh@development

If you did have perfect prediction of every byte you will need, maybe
you could say, oh, well, we just don't have enough memory for a hash
join, so let's do a sort/merge instead. But you can't, because (1)
some types aren't merge-joinable, and (2) in reality sometimes you've
already started the hash join due to imperfect stats so it's too late
to change strategies.

In reply to: Matthias van de Meent (#2)
Re: OOM in hash join

On Fri, 14 Apr 2023 13:21:05 +0200
Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Hi hackers,

Too small value of work_mem cause memory overflow in parallel hash join
because of too much number batches.
There is the plan:

[...]

There is still some gap between size reported by memory context sump and
actual size of backend.
But is seems to be obvious, that trying to fit in work_mem
sharedtuplestore creates so much batches, that them consume much more
memory than work_mem.

Indeed. The memory consumed by batches is not accounted and the consumption
reported in explain analyze is wrong.

Would you be able to test the latest patchset posted [1]/messages/by-id/20230408020119.32a0841b@karst ? This does not fix
the work_mem overflow, but it helps to keep the number of batches
balanced and acceptable. Any feedback, comment or review would be useful.

[1]: /messages/by-id/20230408020119.32a0841b@karst

Regards,

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#4)
Re: OOM in hash join

On Fri, Apr 14, 2023 at 11:43 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

Would you be able to test the latest patchset posted [1] ? This does not fix
the work_mem overflow, but it helps to keep the number of batches
balanced and acceptable. Any feedback, comment or review would be useful.

[1] /messages/by-id/20230408020119.32a0841b@karst

Hi Jehan-Guillaume. I hadn't paid attention to that thread before
probably due to timing and the subject and erm ETOOMANYTHREADS.
Thanks for all the work you've done to study this area and also review
and summarise the previous writing/patches/ideas.