Lyris looking to help fix PostgresSQL crashing problems

Started by John Buckmanover 23 years ago5 messages
#1John Buckman
john@lyris.com

Hello -- I'm the lead programmer of Lyris ListManager, an email list server that run on PostgreSQL, Oracle, and MS/SQL.

About 20% of our client base of 4000 runs on PostgresSQL -- it's very popular with our clients -- much more than Oracle is (about 3%).

Unfortunately we have about a dozen clients who have stability problems with PostgresSQL. This week a major television network cancelled their order with us due to their PostgresSQL stability issues, which is what prompted me to write this email and get involved with the PostgresSQL community.

It seems that with larger database sizes (500,000 rows and larger) and high stress, the server daemon has a tendency to core. We've also had cases where a single connection doing a million inserts into a table will cause the daemon to core. We've seen problems with both 7.1 and 7.2.x, with built-on-the-machine and with RPMs. We've also had big stability problems with Solaris 8/Sparc, and don't ship on that platform because of that.

What I'd like to do is help solve these problems in the core distribution, so that PostrgesSQL can indeed be able to handle the large databases and high transaction loads that Microsoft SQL can.

My company has hired open source people before to help fix bugs or add features to open source projects, most notable from the Tcl community, as we use Tcl quite a bit (we have two programmers from the Tcl Core team working here). This works out well for the Tcl community, as we fund the development of the project, as well as pay someone to work on something they want to work on anyhow.

So... what I'm looking for are recommendations on a PostgresSQL guru who could help nail the stability/load issues, and make sure that the fixes make their way back into the PostgresSQL core. What I'd prefer is to get a regular contributor to this list, so that this person could investigate our problems, and then get the community's help in solving them.

Thanks!

-john

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Buckman (#1)
Re: Lyris looking to help fix PostgresSQL crashing problems

John Buckman <john@lyris.com> writes:

It seems that with larger database sizes (500,000 rows and larger) and
high stress, the server daemon has a tendency to core.

We'd love to see some stack traces ...

regards, tom lane

#3John Buckman
john@lyris.com
In reply to: Tom Lane (#2)
Re: Lyris looking to help fix PostgresSQL crashing problems

John Buckman <john@lyris.com> writes:

It seems that with larger database sizes (500,000 rows and larger) and
high stress, the server daemon has a tendency to core.

We'd love to see some stack traces ...

Yeah, I just didn't know what form this list prefers to work on things, which is why I'd prefer to hire a regular participant of this list. If gcc 'where' stack traces are what you want, we can do that.

I suspect that the problems may be platform-or-build related, because we've often had trouble replicating customer problems on our own sysems. For example, we had many reports of problems with 7.2.x, and saw it crash often on a customer's redhat machine that we had ssh access to, but couldn't make it crash in our own lab. :( That's why we need help. If we could make a simple C test case that crashed pgsql, I'm sure you guys could fix the problem in a jiffy.

-john

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: John Buckman (#3)
Re: Lyris looking to help fix PostgresSQL crashing problems

John Buckman wrote:

John Buckman <john@lyris.com> writes:

It seems that with larger database sizes (500,000 rows and larger) and
high stress, the server daemon has a tendency to core.

We'd love to see some stack traces ...

Yeah, I just didn't know what form this list prefers to work on
things, which is why I'd prefer to hire a regular participant
of this list. If gcc 'where' stack traces are what you want,
we can do that.

Yep, in most cases, the crash creates a core file in the database
directory. A backtrace of that core file is usually a good start. You
should to sure there are debugging symbols in the binary (gcc -g).

The server log files also often contain valuable information.

I suspect that the problems may be platform-or-build related,
because we've often had trouble replicating customer problems
on our own systems. For example, we had many reports of problems
with 7.2.x, and saw it crash often on a customer's redhat machine
that we had ssh access to, but couldn't make it crash in our
own lab. :( That's why we need help. If we could make a simple
C test case that crashed pgsql, I'm sure you guys could fix the
problem in a jiffy.

Yes, that does make it harder, but a backtrace usually gets us started.
It may also be tickling some OS bug or a hardware failure, or a simple
exhaustion of some resource.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5John Buckman
john@lyris.com
In reply to: Bruce Momjian (#4)
Re: Lyris looking to help fix PostgresSQL crashing problems

John Buckman <john@lyris.com> writes:

It seems that with larger database sizes (500,000 rows and larger) and
high stress, the server daemon has a tendency to core.

We'd love to see some stack traces ...

Yeah, I just didn't know what form this list prefers in terms of info to be able to work on things, which is why I'd prefer to hire a regular participant of this list. If gcc 'where' stack traces from core files are what you want, we can do that.

I suspect that the problems may be platform-or-build related, because we've often had trouble replicating customer problems on our own sysems. For example, we had many reports of problems with 7.2.x, and saw it crash often on a customer's redhat machine that we had ssh access to, but couldn't make it crash in our own lab. :( That's why we need help. If we could make a simple C test case that crashed pgsql, I'm sure you guys could fix the problem in a jiffym but localizing and recreating a problem is always 80% of it.

-john