BUG #2246: Bad malloc interactions: ecpg, openssl

Started by Andrew Klostermanabout 20 years ago33 messagesbugs
Jump to latest
#1Andrew Klosterman
andrew5@ece.cmu.edu

The following bug has been logged online:

Bug reference: 2246
Logged by: Andy Klosterman
Email address: andrew5@ece.cmu.edu
PostgreSQL version: 8.1.0
Operating system: Debian testing: Linux nc3 2.4.27-2-386 #1 Wed Nov 30
21:38:51 JST 2005 i686 GNU/Linux
Description: Bad malloc interactions: ecpg, openssl
Details:

Before going into a full description and figuring out some example code for
this situation, I'm fishing for interesting in tracking it down and fixing
it (or not).

On a program that I (pre-)compile with ecpg and connect to a remote Postgres
instance over an SSL connection (as set up in pg_hba.conf with appropriate
certificates installed) my application prematurely terminates with the
following error:
*** glibc detected *** corrupted double-linked list: 0x0807c830 ***
Abort.

(Without an SSL connection (as set in ph_hba.conf) the program executes just
fine. This leads me to cast suspicion on SSL libraries.)

The back trace from gdb looks like this (which doesn't appear to be too
informative, but looks like an exception stack):
#0 0x401bc851 in kill () from /lib/libc.so.6
#1 0x4014a309 in pthread_kill () from /lib/libpthread.so.0
#2 0x4014a6c0 in raise () from /lib/libpthread.so.0
#3 0x401bc606 in raise () from /lib/libc.so.6
#4 0x401bd971 in abort () from /lib/libc.so.6
#5 0x401ef930 in __fsetlocking () from /lib/libc.so.6
#6 0x401f52b9 in malloc_usable_size () from /lib/libc.so.6
#7 0x401f5395 in malloc_usable_size () from /lib/libc.so.6
#8 0x401f5a43 in malloc_trim () from /lib/libc.so.6
#9 0x401f5d51 in free () from /lib/libc.so.6
#10 0x4052ce6c in zcfree () from /usr/lib/libz.so.1
#11 0x4052f83f in inflateEnd () from /usr/lib/libz.so.1
#12 0x4040f262 in COMP_rle () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
#13 0x0807e680 in ?? ()
#14 0x00000000 in ?? ()

After a bit of digging around online, I discovered the MALLOC_CHECK_
environment variable and how it changes the behavior of malloc (man 3
malloc). The above back trace was without MALLOC_CHECK_ in the environment
(e.g., unsetenv MALLOC_CHECK_).

Running with MALLOC_CHECK_ equal to 2 or 1 allows my program to run to
completion.

With MALLOC_CHECK_ set to 0 (which is supposed to ignore corruption), I get
a segfault. Running inside gdb gets me the following back trace:
#0 0x403d6f73 in ASN1_template_free ()
from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#1 0x403d6e0d in ASN1_primitive_free ()
from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#2 0x403d7023 in ASN1_item_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
#3 0x403d0c07 in X509_CERT_AUX_free ()
from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#4 0x403d077a in X509_CINF_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
#5 0x403d6e35 in ASN1_primitive_free ()
from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#6 0x403d7023 in ASN1_item_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
#7 0x403d0927 in X509_free () from
/usr/lib/i686/cmov/libcrypto.so.0.9.8
#8 0x402d16f3 in pqsecure_destroy () from /usr/lib/libpq.so.4
#9 0x402c387a in PQconninfoFree () from /usr/lib/libpq.so.4
#10 0x402c39c3 in PQfinish () from /usr/lib/libpq.so.4
#11 0x4002f41b in ECPGget_connection () from /usr/lib/libecpg.so.5
#12 0x40030223 in ECPGdisconnect () from /usr/lib/libecpg.so.5
#13 0x0804a113 in DBDisconnect (arg_connection=0x8054faf
"client_correctness")
at client_test.pgcc:215
#14 0x0804a64e in DoCorrectnessChecks () at client_test.pgcc:278
#15 0x0804aaa1 in main (argc=7, argv=0xbffffa84) at
client_test.pgcc:523

PURE SPECULATION: It looks like there is either trouble in the interaction
between Postgres and the SSL library or just a bit of trouble within the SSL
library.
SPECULATION: Another possibility is that I misunderstand some aspect of
multi-threaded interactions with Postgres (I open uniquely named connections
to the DB for each thread of my test program). Maybe I need to have a
"lock" around the code that makes DB connections and make sure that only one
happens at a time (might be better handled within Postgres/SSL if that is
the case).

PROCEEDING FURTHER: If there is any desire on the part of any developers to
pursue this further, I'm open. As things stand right now, I have
workarounds:
1. Don't use an SSL connection to the DB.
2. Do a "setenv MALLOC_CHECK_ 1" (or 2) and it works.

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Klosterman (#1)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

Andy Klosterman wrote:

Before going into a full description and figuring out some example code for
this situation, I'm fishing for interesting in tracking it down and fixing
it (or not).

Whenever there is a bug that causes a crash, there is interest in
tracking it down and fixing it. Please do provide a test case.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Klosterman (#1)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

"Andy Klosterman" <andrew5@ece.cmu.edu> writes:

SPECULATION: Another possibility is that I misunderstand some aspect of
multi-threaded interactions with Postgres (I open uniquely named connections
to the DB for each thread of my test program). Maybe I need to have a
"lock" around the code that makes DB connections and make sure that only one
happens at a time (might be better handled within Postgres/SSL if that is
the case).

There could be some re-entrancy problem in the SSL connection startup
code --- if you add such a lock, does it get more reliable? Also, did
you remember to build PG with --enable-thread-safety ?

regards, tom lane

#4Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#3)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Wed, 8 Feb 2006, Tom Lane wrote:

"Andy Klosterman" <andrew5@ece.cmu.edu> writes:

SPECULATION: Another possibility is that I misunderstand some aspect of
multi-threaded interactions with Postgres (I open uniquely named connections
to the DB for each thread of my test program). Maybe I need to have a
"lock" around the code that makes DB connections and make sure that only one
happens at a time (might be better handled within Postgres/SSL if that is
the case).

There could be some re-entrancy problem in the SSL connection startup
code --- if you add such a lock, does it get more reliable? Also, did
you remember to build PG with --enable-thread-safety ?

regards, tom lane

(I'm back after a bit of an illness. Much better now!)

I threw in a pthread mutex around the code making the database connections
for each of my threads. The problem is still there ("corrupted
double-linked list").

Even tuning things down and instructing my code to only run a single
pthread manifests the problem over an SSL connection. Everything is just
fine without SSL. Other code I've written works just fine with (and
without) threads connecting to the database with (and without) SSL.
Tracking down exactly what's tickling the problem in this case could be
tricky...

I'm using the pre-built debian testing packages, not self-compiled code,
for my postgres installation. From the information I can gather from the
debian build logs (http://buildd.debian.org/build.php), everything was
configured and built with threads enabled.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Klosterman (#4)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

Andrew Klosterman <andrew5@ece.cmu.edu> writes:

I threw in a pthread mutex around the code making the database connections
for each of my threads. The problem is still there ("corrupted
double-linked list").

Even tuning things down and instructing my code to only run a single
pthread manifests the problem over an SSL connection.

Hmm. Based on that, the problem is starting to smell more like a
garden-variety memory clobber, for instance malloc'ing a chunk smaller
than the data that's later stuffed into it. It might be worth running
the program under something like ElectricFence, which will catch the
offender on-the-spot rather than only later when corruption of malloc's
private data structures is detected.

Looking back at your original message, I wonder if it could be the
combination of ecpg and SSL that triggers it? I'd have thought that
libpq/SSL alone would be pretty well wrung out, but ecpg is not so
widely used.

BTW, you did say this was i386 right? If it were a 64-bit architecture,
I'd be about ready to bet money on the wrong-malloc-size-calculation
theory.

Tracking down exactly what's tickling the problem in this case could be
tricky...

Yeah :-(. If you aren't able to narrow it further by yourself, please
try to put together a self-contained test case.

regards, tom lane

#6Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#5)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Mon, 13 Feb 2006, Tom Lane wrote:

Andrew Klosterman <andrew5@ece.cmu.edu> writes:

I threw in a pthread mutex around the code making the database connections
for each of my threads. The problem is still there ("corrupted
double-linked list").

Even tuning things down and instructing my code to only run a single
pthread manifests the problem over an SSL connection.

Hmm. Based on that, the problem is starting to smell more like a
garden-variety memory clobber, for instance malloc'ing a chunk smaller
than the data that's later stuffed into it. It might be worth running
the program under something like ElectricFence, which will catch the
offender on-the-spot rather than only later when corruption of malloc's
private data structures is detected.

Looking back at your original message, I wonder if it could be the
combination of ecpg and SSL that triggers it? I'd have thought that
libpq/SSL alone would be pretty well wrung out, but ecpg is not so
widely used.

BTW, you did say this was i386 right? If it were a 64-bit architecture,
I'd be about ready to bet money on the wrong-malloc-size-calculation
theory.

Tracking down exactly what's tickling the problem in this case could be
tricky...

Yeah :-(. If you aren't able to narrow it further by yourself, please
try to put together a self-contained test case.

regards, tom lane

I just did the "electric fence" thing for you and this is what I get in
gdb...

Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.

ElectricFence Aborting: Allocating 0 bytes, probably a bug.

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 16384 (LWP 24753)]
0x401c3851 in kill () from /lib/libc.so.6
(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Looks like something fishy going on between libpq and libkrb5. I'm
especially suspicious since I'm not using kerberos for authentication at
all.

I am developing on i386 (more or less).
# uname -m
i686

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Klosterman (#6)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

Andrew Klosterman <andrew5@ece.cmu.edu> writes:

(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Any chance of doing this with debug symbols? libpq does not call
krb5_set_default_tgs_ktypes directly, so I don't think I believe the
above backtrace. gdb is easily misled without debug symbols :-(

I'm not sure if Debian does things the way Red Hat does, but on RH
there are separate "debuginfo" RPMs corresponding to each regular
RPM --- if you install the ones matching your libpq and libkrb5
RPMs you should be able to get better info.

regards, tom lane

#8Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#7)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Andrew Klosterman <andrew5@ece.cmu.edu> writes:

(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Any chance of doing this with debug symbols? libpq does not call
krb5_set_default_tgs_ktypes directly, so I don't think I believe the
above backtrace. gdb is easily misled without debug symbols :-(

Hrmpf, I missed this bug-on-Debian report. I'll go check the archive
for the rest.

I'm not sure if Debian does things the way Red Hat does, but on RH
there are separate "debuginfo" RPMs corresponding to each regular
RPM --- if you install the ones matching your libpq and libkrb5
RPMs you should be able to get better info.

We do have debugging .debs- for some things. We don't have them for
everything and unfortunately we don't yet have them for Postgres. I'll
talk to Martin about building some though so that in the future it's
easier to debug these problems.

Thanks,

Stephen

#9Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#7)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Mon, 13 Feb 2006, Tom Lane wrote:

Andrew Klosterman <andrew5@ece.cmu.edu> writes:

(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Any chance of doing this with debug symbols? libpq does not call
krb5_set_default_tgs_ktypes directly, so I don't think I believe the
above backtrace. gdb is easily misled without debug symbols :-(

I'm not sure if Debian does things the way Red Hat does, but on RH
there are separate "debuginfo" RPMs corresponding to each regular
RPM --- if you install the ones matching your libpq and libkrb5
RPMs you should be able to get better info.

regards, tom lane

I thought about that and did some quick checks of how to get debug symbols
in libraries on Debian. I didn't come up with anything right away. I'll
poke around and see what I can come up with.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#8)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

Stephen Frost <sfrost@snowman.net> writes:

We do have debugging .debs- for some things. We don't have them for
everything and unfortunately we don't yet have them for Postgres. I'll
talk to Martin about building some though so that in the future it's
easier to debug these problems.

Hmm. Andrew, it seems your choices are to rebuild the relevant
libraries from source, or to concentrate on developing a test case
that other people can try.

regards, tom lane

#11Stephen Frost
sfrost@snowman.net
In reply to: Andrew Klosterman (#6)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Looks like something fishy going on between libpq and libkrb5. I'm
especially suspicious since I'm not using kerberos for authentication at
all.

Seems kind of unlikely... What exact (.deb) versions of libpq and
Postgres are you using? You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

Thanks,

Stephen

#12Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Stephen Frost (#11)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Mon, 13 Feb 2006, Stephen Frost wrote:

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

Looks like something fishy going on between libpq and libkrb5. I'm
especially suspicious since I'm not using kerberos for authentication at
all.

Seems kind of unlikely... What exact (.deb) versions of libpq and
Postgres are you using? You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

Thanks,

Stephen

Running "aptitude show X" where "X" is the package name, and applying
appropriate filtering gives the following results on my development
systems:

Package: libpq-dev
Version: 8.1.0-3

Package: libpq3
Version: 1:7.4.9-2

Package: libpq4
Version: 8.1.0-3

Package: postgresql-8.1
Version: 8.1.0-3

Package: postgresql-contrib-8.1
Version: 8.1.0-3

Package: postgresql-server-dev-8.1
Version: 8.1.0-3

Package: postgresql-client-8.1
Version: 8.1.0-3

Package: postgresql-common
Version: 39

(I frequently update and upgrade my installations...)

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#13Stephen Frost
sfrost@snowman.net
In reply to: Andrew Klosterman (#12)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

Seems kind of unlikely... What exact (.deb) versions of libpq and
Postgres are you using? You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

Running "aptitude show X" where "X" is the package name, and applying
appropriate filtering gives the following results on my development
systems:

Package: libpq-dev
Version: 8.1.0-3

Package: libpq3
Version: 1:7.4.9-2

Package: libpq4
Version: 8.1.0-3

Package: postgresql-8.1
Version: 8.1.0-3

Package: postgresql-contrib-8.1
Version: 8.1.0-3

Package: postgresql-server-dev-8.1
Version: 8.1.0-3

Package: postgresql-client-8.1
Version: 8.1.0-3

Package: postgresql-common
Version: 39

Hmm, alright, well, this is at least not the fault of the patch of mine
which was included in Debian's 8.1.2-2 Postgres release. :) You might
try compiling some debs with debugging enabled. This is (reasonably)
straight-forward:

(as root:)
aptitude install build-essential debhelper cdbs bison perl libperl-dev \
tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
gettext bzip2 fakeroot
(as user:)
apt-get source postgresql-8.1
cd postgresql-8.1-8.1.0
export DEB_BUILD_OPTIONS="nostrip"
dpkg-buildpackage -uc -us -rfakeroot

Should produce .debs in the parent directory which have debugging
information. Another useful build option is "noopt", ie:
export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the
error go disappear. It'd be terribly nice if you could do this and
provide a gdb backtrace with debugging... :)

Thanks,

Stephen

#14Jens-Wolfhard Schicke
ml+pgsql-bug@asco.de
In reply to: Stephen Frost (#13)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

--On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost
<sfrost@snowman.net> wrote:

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

Seems kind of unlikely... What exact (.deb) versions of libpq and
Postgres are you using? You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

aptitude install build-essential debhelper cdbs bison perl libperl-dev \
tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
gettext bzip2 fakeroot

You might want to add valgrind to this list. It analyzes code on assembler
basis and does a lot of memory checking / undefined variables checking
while the program runs. Fixed all SIGSEGV I ever encoutered which were not
infinite recursions.

Mit freundlichem Gruß
Jens Schicke
--
Jens Schicke j.schicke@asco.de
asco GmbH http://www.asco.de
Mittelweg 7 Tel 0531/3906-127
38106 Braunschweig Fax 0531/3906-400

#15Volkan YAZICI
yazicivo@ttnet.net.tr
In reply to: Andrew Klosterman (#6)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Feb 13 04:01, Andrew Klosterman wrote:

I threw in a pthread mutex around the code making the database connections
for each of my threads. The problem is still there ("corrupted
double-linked list").
...
Program received signal SIGILL, Illegal instruction.
[Switching to Thread 16384 (LWP 24753)]
0x401c3851 in kill () from /lib/libc.so.6
(gdb) bt
#0 0x401c3851 in kill () from /lib/libc.so.6
#1 0x40139dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40139823 in memalign () from /usr/lib/libefence.so.0
#3 0x401399ad in malloc () from /usr/lib/libefence.so.0
#4 0x40139a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a182f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c8b3f in ?? () from /usr/lib/libpq.so.4
#7 0x402ded88 in ?? () from /usr/lib/libpq.so.4
#8 0x00000000 in ?? ()

I met with some other thread-safety issues caused by libc used in
Debian repos. For instance, getpwuid_r() is broken in Debian's
current stable libc package and this causes a similar memory leak
in the libpq code.

IMHO, testing code with a newer libc version can be the solution.
Otherwise, for an exact answer - as Tom said - we need libpq symbols
in the backtrace.

Regards.

#16Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#5)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

Tracking down exactly what's tickling the problem in this case could be
tricky...

Yeah :-(. If you aren't able to narrow it further by yourself, please
try to put together a self-contained test case.

regards, tom lane

Well, my attempt last night at putting together a test case that manifests
the error that I encountered was a total failure! The test code executes
flawlessly: no abnormal termination.

There must be something different between the two programs. But my
original is considerably more complex. I'll pursue other options for
debugging before returning to figuring out the difference between the
"real" code and the "test-case" code.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#17Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Jens-Wolfhard Schicke (#14)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Tue, 14 Feb 2006, Jens-Wolfhard Schicke wrote:

--On Montag, Februar 13, 2006 21:25:30 -0500 Stephen Frost
<sfrost@snowman.net> wrote:

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

Seems kind of unlikely... What exact (.deb) versions of libpq and
Postgres are you using? You originally posted w/ 8.1.0 but perhaps on
the client you had something more recent?

aptitude install build-essential debhelper cdbs bison perl libperl-dev \
tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
gettext bzip2 fakeroot

You might want to add valgrind to this list. It analyzes code on assembler
basis and does a lot of memory checking / undefined variables checking
while the program runs. Fixed all SIGSEGV I ever encoutered which were not
infinite recursions.

Mit freundlichem Gru���
Jens Schicke

I tried valgrind this morning. It detected problems in the depths of the
code behind ECPGconnect() down through SSL_read() and inflate(). Also,
there was trouble reported behind ECPGconnect() -> PQsetdbLogin() ->
pqGetpwuid() -> getpwuid_r() -> _dl_open() -> into the depths of
/lib/ld-2.3.5.so. Valgrind got so upset at the number of errors it found
that it gave up. Nothing bad seemed to show up in the code that I wrote.

But, while running under valgrind, the original program that manifests the
error condition runs just fine and to completion (maybe the errors are
just ignored in valgrind's replacement version of malloc as they are
with the MALLOC_CHECK_ environment variable set).

I'm moving on to try building the binaries without removing the symbols.
Hopefully that will give more useful information...

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#18Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Stephen Frost (#13)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Mon, 13 Feb 2006, Stephen Frost wrote:

Hmm, alright, well, this is at least not the fault of the patch of mine
which was included in Debian's 8.1.2-2 Postgres release. :) You might
try compiling some debs with debugging enabled. This is (reasonably)
straight-forward:

(as root:)
aptitude install build-essential debhelper cdbs bison perl libperl-dev \
tk8.4-dev flex libreadline5-dev libssl-dev zlib1g-dev \
libpam0g-dev libxml2-dev libkrb5-dev libxslt1-dev python-dev \
gettext bzip2 fakeroot
(as user:)
apt-get source postgresql-8.1
cd postgresql-8.1-8.1.0
export DEB_BUILD_OPTIONS="nostrip"
dpkg-buildpackage -uc -us -rfakeroot

Should produce .debs in the parent directory which have debugging
information. Another useful build option is "noopt", ie:
export DEB_BUILD_OPTIONS="nostrip noopt", though that could make the
error go disappear. It'd be terribly nice if you could do this and
provide a gdb backtrace with debugging... :)

Thanks,

Stephen

Alright, I have built a system with the symbols left into the binaries.

It still crashes with the "corrupted double-linked list" error.

Running with ElectricFence the backtrace I get is:

Electric Fence 2.1 Copyright (C) 1987-1998 Bruce Perens.

ElectricFence Aborting: Allocating 0 bytes, probably a bug.

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 16384 (LWP 1895)]
0x401c4851 in kill () from /lib/libc.so.6
(gdb) bt
#0 0x401c4851 in kill () from /lib/libc.so.6
#1 0x40037dd5 in EF_Abort () from /usr/lib/libefence.so.0
#2 0x40037823 in memalign () from /usr/lib/libefence.so.0
#3 0x400379ad in malloc () from /usr/lib/libefence.so.0
#4 0x40037a10 in calloc () from /usr/lib/libefence.so.0
#5 0x404a282f in krb5_set_default_tgs_ktypes () from /usr/lib/libkrb5.so.3
#6 0x402c9b26 in pg_krb5_init (PQerrormsg=0x0) at fe-auth.c:119
#7 0x402ca304 in pg_fe_getauthname (PQerrormsg=0xbffff29c "l\031")
at fe-auth.c:176
#8 0x402cc861 in conninfo_parse (conninfo=<value optimized out>,
errorMessage=0x4057afe8) at fe-connect.c:2719
#9 0x402cc983 in connectOptions1 (conn=0x4057acdc, conninfo=0x0)
at fe-connect.c:362
#10 0x402cda11 in PQsetdbLogin (pghost=0x40574ffc "nc3", pgport=0x0,
pgoptions=0x0, pgtty=0x0, dbName=0x40576ff8 "andrew5",
login=0xbffffc31 "andrew5", pwd=0xbffffc3c "testbed") at fe-connect.c:568
#11 0x40030fe7 in ECPGconnect (lineno=191, c=0, name=0xbffffc22 "andrew5@nc3",
user=0xbffffc31 "andrew5", passwd=0x0,
connection_name=0xbffff8b0 "CorrectnessCheck", autocommit=0)
at connect.c:452
#12 0x08049ecb in DBConnect (arg_connection=0xbffff964 "CorrectnessCheck")
at client_test.pgcc:191
#13 0x0804a14f in DoCorrectnessChecks () at client_test.pgcc:231
#14 0x0804aa08 in main (argc=9, argv=0xbffffa74) at client_test.pgcc:526

Again, it is showing a bad malloc in what appears to be some code using
kerberos. But there's nothing in my setup that I can think of right now
that should induce a connection to be set up using kerberos.

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#19Stephen Frost
sfrost@snowman.net
In reply to: Andrew Klosterman (#18)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

* Andrew Klosterman (andrew5@ece.cmu.edu) wrote:

Alright, I have built a system with the symbols left into the binaries.

[...]

Again, it is showing a bad malloc in what appears to be some code using
kerberos. But there's nothing in my setup that I can think of right now
that should induce a connection to be set up using kerberos.

The Kerberos libraries are still called when support for them has been
compiled in. They generally don't cause any problems though. For some
reason the line numbers in the backtrace line up but the function names
don't quite (perhaps inlineing). Anyhow, the error is being reported
down in 'krb5_init_context()' so either something strange is happening
or it's actually a Kerberos bug after all. The reason the Kerberos
libraries are called is to get the 'username' to use, which is
determined prior to actually connecting to the backend (and finding
out what authentication mechanism the backend thinks we should be
trying).

It's kind of a chicken-and-egg here because the backend decides what
authentication mechanism to ask for based off the username (at least in
part) through pg_hba.conf, so you can't find out the authentication
method until you know the username so all methods to find the username
have to be exhausted. You could avoid this by explicitly passing
'user=' into the connection parameters though... Would be interesting
to know what happens then...

Might also be interesting to look into the Kerberos libraries to see why
they're attempting to malloc(0), perhaps there's a bug there when
Kerberos isn't set up on the machine?

Thanks,

Stephen

#20Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Stephen Frost (#19)
Re: BUG #2246: Bad malloc interactions: ecpg, openssl

On Tue, 14 Feb 2006, Stephen Frost wrote:
<snip>

It's kind of a chicken-and-egg here because the backend decides what
authentication mechanism to ask for based off the username (at least in
part) through pg_hba.conf, so you can't find out the authentication
method until you know the username so all methods to find the username
have to be exhausted. You could avoid this by explicitly passing
'user=' into the connection parameters though... Would be interesting
to know what happens then...

When asking about "explicitly passing 'user=' in to the connection
parameters" do you mean that the EXEC SQL CONNECT line that ecpg parses
should specify a username?

My code is using the following statement when making a remote connection
that uses SSL.

EXEC SQL CONNECT TO :l_target AS :l_connection
USER :l_user IDENTIFIED BY :l_passwd;

The target machine (hosting the database) has "ssl=on" in postgresql.conf
and in its pg_hba.conf (snippet below) a line for the client machine from
which I am making the connection that specifies an SSL connection should
be made.

# TYPE DATABASE USER CIDR-ADDRESS METHOD
hostssl andrew5 andrew5 172.19.130.4/32 pam passwd

--Andrew J. Klosterman
andrew5@ece.cmu.edu

#21Stephen Frost
sfrost@snowman.net
In reply to: Andrew Klosterman (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#21)
#23Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#22)
#24Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#22)
#25Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Andrew Klosterman (#23)
#26Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Andrew Klosterman (#18)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Klosterman (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Klosterman (#26)
#29Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#22)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#29)
#31Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#30)
#32Andrew Klosterman
andrew5@ece.cmu.edu
In reply to: Tom Lane (#28)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#31)