BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

Started by PG Bug reporting formalmost 8 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15217
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 10.4
Operating system: Ubuntu 16.04
Description:

Hello,

When running PostgreSQL (REL_10_STABLE) under valgrind I am getting the
following error messages sporadically:

==00:03:26:22.952 1623== Syscall param write(buf) points to uninitialised
byte(s)
==00:03:26:22.952 1623== at 0x4E4A4A0: __write_nocancel
(syscall-template.S:84)
==00:03:26:22.952 1623== by 0x72F0EF: FileWrite (fd.c:1773)
==00:03:26:22.952 1623== by 0x7310D1: BufFileDumpBuffer (buffile.c:324)
==00:03:26:22.952 1623== by 0x7313F1: BufFileFlush (buffile.c:466)
==00:03:26:22.952 1623== by 0x7313F1: BufFileRead (buffile.c:372)
==00:03:26:22.952 1623== by 0x894958: ltsReadBlock (logtape.c:248)
==00:03:26:22.952 1623== by 0x894A34: ltsReadFillBuffer (logtape.c:275)
==00:03:26:22.952 1623== by 0x89CB9F: mergeruns (tuplesort.c:2683)
==00:03:26:22.952 1623== by 0x89CB9F: tuplesort_performsort
(tuplesort.c:1823)
==00:03:26:22.952 1623== by 0x6414C2: ExecSort (nodeSort.c:117)
==00:03:26:22.952 1623== by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623== by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623== by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623== by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623== by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623== Address 0xfa8a27e is 6,350 bytes inside a block of
size 8,256 client-defined
==00:03:26:22.952 1623== at 0x88FC20: palloc (mcxt.c:872)
==00:03:26:22.952 1623== by 0x7312CF: makeBufFile (buffile.c:107)
==00:03:26:22.952 1623== by 0x7312CF: BufFileCreateTemp (buffile.c:175)
==00:03:26:22.952 1623== by 0x894B14: LogicalTapeSetCreate
(logtape.c:390)
==00:03:26:22.952 1623== by 0x89B2F2: inittapes (tuplesort.c:2410)
==00:03:26:22.952 1623== by 0x89B2F2: puttuple_common
(tuplesort.c:1622)
==00:03:26:22.952 1623== by 0x89C4CE: tuplesort_putheaptuple
(tuplesort.c:1397)
==00:03:26:22.952 1623== by 0x6414A0: ExecSort (nodeSort.c:111)
==00:03:26:22.952 1623== by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623== by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623== by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623== by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623== by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623== by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623== by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623== by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623== Uninitialised value was created by a heap
allocation
==00:03:26:22.952 1623== at 0x88FC20: palloc (mcxt.c:872)
==00:03:26:22.952 1623== by 0x894DC2: LogicalTapeWrite (logtape.c:476)
==00:03:26:22.952 1623== by 0x895D81: writetup_heap (tuplesort.c:3757)
==00:03:26:22.952 1623== by 0x89AEE9: dumpbatch (tuplesort.c:3100)
==00:03:26:22.952 1623== by 0x89AEE9: dumptuples (tuplesort.c:2972)
==00:03:26:22.952 1623== by 0x89CA0B: tuplesort_performsort
(tuplesort.c:1822)
==00:03:26:22.952 1623== by 0x6414C2: ExecSort (nodeSort.c:117)
==00:03:26:22.952 1623== by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623== by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623== by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623== by 0x638AE6: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x638AE6: gather_merge_readnext
(nodeGatherMerge.c:634)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_init
(nodeGatherMerge.c:468)
==00:03:26:22.952 1623== by 0x638E5C: gather_merge_getnext
(nodeGatherMerge.c:536)
==00:03:26:22.952 1623== by 0x638E5C: ExecGatherMerge
(nodeGatherMerge.c:250)
==00:03:26:22.952 1623== by 0x6296F8: ExecProcNode (executor.h:250)
==00:03:26:22.952 1623== by 0x6296F8: fetch_input_tuple (nodeAgg.c:695)
==00:03:26:22.952 1623== by 0x62BBBA: agg_retrieve_direct
(nodeAgg.c:2347)
==00:03:26:22.952 1623== by 0x62BBBA: ExecAgg (nodeAgg.c:2158)
==00:03:26:22.952 1623==
{
<insert_a_suppression_name_here>
Memcheck:Param
write(buf)
fun:__write_nocancel
fun:FileWrite
fun:BufFileDumpBuffer
fun:BufFileFlush
fun:BufFileRead
fun:ltsReadBlock
fun:ltsReadFillBuffer
fun:mergeruns
fun:tuplesort_performsort
fun:ExecSort
fun:ExecProcNode
fun:fetch_input_tuple
fun:agg_retrieve_direct
fun:ExecAgg
fun:ExecProcNode
fun:gather_merge_readnext
fun:gather_merge_init
fun:gather_merge_getnext
fun:ExecGatherMerge
}

Is this something that needs further investigation (and fix) or it just
looks like a false positive (and should be added to valgrind.supp)?

In reply to: PG Bug reporting form (#1)
Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

On Tue, May 29, 2018 at 9:02 PM, PG Bug reporting form
<noreply@postgresql.org> wrote:

Is this something that needs further investigation (and fix) or it just
looks like a false positive (and should be added to valgrind.supp)?

This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

--
Peter Geoghegan

#3Alexander Lakhin
exclusion@gmail.com
In reply to: Peter Geoghegan (#2)
Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

30.05.2018 08:54, Peter Geoghegan wrote:

This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

Thanks for the answer!

The work_mem setting is 4MB. To get EXPLAIN ANALYZE is rather difficult
as I get these errors while running pg_tcpds for hours, but I'll try to
catch the query. I will also try to play with work_mem and to apply the
aforementioned patches to REL_10_STABLE.

Best regards,

------
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#4Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#3)
Re: BUG #15217: Valgrind - writing uninitialised byte(s) in FileWrite

30.05.2018 09:35, Alexander Lakhin wrote:

30.05.2018 08:54, Peter Geoghegan wrote:

This looks like the same harmless warning that we suppressed for
parallel sort in Postgres 11 (see commits 9fafa413ac6 and
de6428afe13). This happens to be a serial sort in Postgres 10, but the
fact that it's still possible there isn't surprising.

What's your work_mem setting? Can you show EXPLAIN ANALYZE output for the query?

Thanks for the answer!

The work_mem setting is 4MB. To get EXPLAIN ANALYZE is rather
difficult as I get these errors while running pg_tcpds for hours, but
I'll try to catch the query. I will also try to play with work_mem and
to apply the aforementioned patches to REL_10_STABLE.

I've managed to reproduce this warning with pg_tpch.
The "problematic" query:
https://github.com/tvondra/pg_tpch/blob/master/dss/templates/10.sql
EXPLAIN ANALYZE:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=178881.96..178882.01 rows=20 width=202) (actual
time=41708.427..41708.994 rows=20 loops=1)
   ->  Sort  (cost=178881.96..179020.08 rows=55249 width=202) (actual
time=41706.775..41706.828 rows=20 loops=1)
         Sort Key: (sum((lineitem.l_extendedprice * ('1'::numeric -
lineitem.l_discount)))) DESC
         Sort Method: top-N heapsort  Memory: 34kB
         ->  Finalize GroupAggregate (cost=170313.58..177411.81
rows=55249 width=202) (actual time=27586.165..40109.437 rows=37948 loops=1)
               Group Key: customer.c_custkey, nation.n_name
               ->  Gather Merge (cost=170313.58..176260.79 rows=46040
width=202) (actual time=27570.718..33645.648 rows=37948 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Partial GroupAggregate
(cost=169313.56..169946.61 rows=23020 width=202) (actual
time=26845.343..34327.273 rows=12649 loops=3)
                           Group Key: customer.c_custkey, nation.n_name
                           ->  Sort (cost=169313.56..169371.11
rows=23020 width=182) (actual time=26801.917..27406.584 rows=38258 loops=3)
                                 Sort Key: customer.c_custkey,
nation.n_name
                                 Sort Method: external sort  Disk: 7016kB
                                 ->  Hash Join
(cost=35074.91..165598.19 rows=23020 width=182) (actual
time=8882.471..24071.174 rows=38258 loops=3)
                                       Hash Cond: (customer.c_nationkey
= nation.n_nationkey)
                                       ->  Nested Loop
(cost=35073.35..165314.18 rows=23020 width=164) (actual
time=8833.355..23146.172 rows=38258 loops=3)
                                             ->  Merge Join 
(cost=35072.92..42628.04 rows=23461 width=156) (actual
time=8823.262..13619.941 rows=19073 loops=3)
                                                   Merge Cond:
(customer.c_custkey = orders.o_custkey)
                                                   -> Parallel Index
Scan using customer_pkey on customer (cost=0.42..6742.42 rows=62500
width=152) (actual time=13.463..1101.789 rows=50000 loops=3)
                                                   -> Materialize 
(cost=35072.48..35354.01 rows=56306 width=12) (actual
time=8795.265..10510.505 rows=57105 loops=3)
->  Sort  (cost=35072.48..35213.24 rows=56306 width=12) (actual
time=8790.529..9667.322 rows=57105 loops=3)
Sort Key: orders.o_custkey
Sort Method: external merge  Disk: 1256kB
->  Bitmap Heap Scan on orders  (cost=1197.56..29667.15 rows=56306
width=12) (actual time=273.899..6733.885 rows=57218 loops=3)
Recheck Cond: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate <
'1993-10-01 00:00:00'::timestamp without time zone))
Rows Removed by Index Recheck: 430095
Heap Blocks: exact=15937 lossy=8327
->  Bitmap Index Scan on idx_orders_orderdate (cost=0.00..1183.49
rows=56306 width=0) (actual time=225.114..225.114 rows=57218 loops=3)
Index Cond: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate <
'1993-10-01 00:00:00'::timestamp without time zone))
                                             ->  Index Scan using
idx_lineitem_orderkey on lineitem  (cost=0.43..5.19 rows=4 width=20)
(actual time=0.372..0.448 rows=2 loops=57218)
                                                   Index Cond:
(l_orderkey = orders.o_orderkey)
                                                   Filter:
(l_returnflag = 'R'::bpchar)
                                                   Rows Removed by
Filter: 2
                                       ->  Hash (cost=1.25..1.25
rows=25 width=30) (actual time=31.890..31.890 rows=25 loops=3)
                                             Buckets: 1024 Batches: 1 
Memory Usage: 10kB
                                             ->  Seq Scan on nation 
(cost=0.00..1.25 rows=25 width=30) (actual time=6.582..8.616 rows=25
loops=3)
 Planning time: 689.897 ms
 Execution time: 41835.301 ms
(39 rows)

(The work_mem setting was decreased to 1MB.)
Valgrind logs of the three backends affected are attached.

After applying the changes from 9fafa413ac6 and de6428afe13 I don't see
this warning.

Best regards,

------
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

postmaster-86.logtext/x-log; name=postmaster-86.logDownload
postmaster-88.logtext/x-log; name=postmaster-88.logDownload
postmaster-89.logtext/x-log; name=postmaster-89.logDownload