any suggestions to detect memory corruption

Started by Alexover 6 years ago6 messages
#1Alex
zhihui.fan1213@gmail.com

I can get the following log randomly and I am not which commit caused it.
I spend one day but failed at last.

2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index
info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18
2019-05-08 21:37:46.692 CST [60110] WARNING: idx: 2 problem in alloc set
index info: bad single-chunk 0x2a33a78 in block 0x2a33a18, chsize: 1408,
chunkLimit: 1024, chunkHeaderSize: 24, block_used: 768 request size: 2481
2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index
info: found inconsistent memory block 0x2a33a18

it looks like the memory which is managed by "index info" memory context
is written by some other wrong codes.

I didn't change any AllocSetXXX related code and I think I just use it
wrong in some way.

Thanks

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex (#1)
Re: any suggestions to detect memory corruption

Alex <zhihui.fan1213@gmail.com> writes:

I can get the following log randomly and I am not which commit caused it.

2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index
info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18

I've had success in finding memory stomp causes fairly quickly by setting
a hardware watchpoint in gdb on the affected location. Then you just let
it run to see when the value changes, and check whether that's a "legit"
or "not legit" modification point.

The hard part of that, of course, is to know in advance where the affected
location is. You may be able to make things sufficiently repeatable by
doing the problem query in a fresh session each time.

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: any suggestions to detect memory corruption

On Wed, May 8, 2019 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alex <zhihui.fan1213@gmail.com> writes:

I can get the following log randomly and I am not which commit caused it.

2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index
info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18

I've had success in finding memory stomp causes fairly quickly by setting
a hardware watchpoint in gdb on the affected location. Then you just let
it run to see when the value changes, and check whether that's a "legit"
or "not legit" modification point.

The hard part of that, of course, is to know in advance where the affected
location is. You may be able to make things sufficiently repeatable by
doing the problem query in a fresh session each time.

valgrind might also be a possibility, although that has a lot of overhead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Alex
zhihui.fan1213@gmail.com
In reply to: Robert Haas (#3)
Re: any suggestions to detect memory corruption

Thanks you Tom and Robert! I tried valgrind, and looks it help me fix
the issue.

Someone add some code during backend init which used palloc. but at that
time, the CurrentMemoryContext is PostmasterContext. at the end of
backend initialization, the PostmasterContext is deleted, then the error
happens. the reason why it happens randomly is before the palloc, there
are some other if clause which may skip the palloc.

I still can't explain why PostmasterContext may have impact "index info"
MemoryContext sometime, but now I just can't reproduce it (before the
fix, it may happen in 30% cases).

On Thu, May 9, 2019 at 1:21 AM Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Wed, May 8, 2019 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alex <zhihui.fan1213@gmail.com> writes:

I can get the following log randomly and I am not which commit caused

it.

2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set

index

info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18

I've had success in finding memory stomp causes fairly quickly by setting
a hardware watchpoint in gdb on the affected location. Then you just let
it run to see when the value changes, and check whether that's a "legit"
or "not legit" modification point.

The hard part of that, of course, is to know in advance where the

affected

location is. You may be able to make things sufficiently repeatable by
doing the problem query in a fresh session each time.

valgrind might also be a possibility, although that has a lot of overhead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alex (#4)
Re: any suggestions to detect memory corruption

Alex <zhihui.fan1213@gmail.com> writes:

Someone add some code during backend init which used palloc. but at that
time, the CurrentMemoryContext is PostmasterContext. at the end of
backend initialization, the PostmasterContext is deleted, then the error
happens. the reason why it happens randomly is before the palloc, there
are some other if clause which may skip the palloc.

I still can't explain why PostmasterContext may have impact "index info"
MemoryContext sometime, but now I just can't reproduce it (before the
fix, it may happen in 30% cases).

Well, once the context is deleted, that memory is available for reuse.
Everything will seem fine until it *is* reused, and then boom!

The error would have been a lot more obvious if you'd enabled
MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage.
That is normally turned on in --enable-cassert builds. Anybody who's been
hacking Postgres for more than a week does backend code development in
--enable-cassert mode as a matter of course; it turns on a *lot* of
helpful cross-checks.

regards, tom lane

#6Alex
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#5)
Re: any suggestions to detect memory corruption

On Thu, May 9, 2019 at 9:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alex <zhihui.fan1213@gmail.com> writes:

Someone add some code during backend init which used palloc. but at that
time, the CurrentMemoryContext is PostmasterContext. at the end of
backend initialization, the PostmasterContext is deleted, then the error
happens. the reason why it happens randomly is before the palloc, there
are some other if clause which may skip the palloc.

I still can't explain why PostmasterContext may have impact "index info"
MemoryContext sometime, but now I just can't reproduce it (before the
fix, it may happen in 30% cases).

Well, once the context is deleted, that memory is available for reuse.
Everything will seem fine until it *is* reused, and then boom!

The error would have been a lot more obvious if you'd enabled
MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage.

Thanks! I didn't know this before and " once the context is deleted,
that memory is available for reuse.
Everything will seem fine until it *is* reused". I have enabled
enable-cassert now.

That is normally turned on in --enable-cassert builds. Anybody who's been

Show quoted text

hacking Postgres for more than a week does backend code development in
--enable-cassert mode as a matter of course; it turns on a *lot* of
helpful cross-checks.

regards, tom lane