to pickle or not to pickle

Started by Marc Tardifalmost 26 years ago5 messagesgeneral

intmktg@CAM.ORG

almost 26 years ago

I'm writing a search engine using python and postgresql which requires to
store a temporary list of results in an sql table for each request. This
list will contain at least 50 records and could grow to about 300. My
options are either to pickle the list and store a single entry or use the
postgresql COPY command (as opposed to INSERT which would be too slow) to
store each of the temporary records.

Question is, how can I make an educated decision on which option to
select? What kind of questions should I be asking myself? Should I
actually go through the trouble of implementing both alternatives and
profiling each seperately? If so, how can I predict what will happen under
a heavy load which is hard to simulate when benchmarking each option?

Thanks

Jurgen Defurne

defurnj@glo.be

almost 26 years ago

In reply to: Marc Tardif (#1)

Re: to pickle or not to pickle

Marc Tardif wrote:

I'm writing a search engine using python and postgresql which requires to
store a temporary list of results in an sql table for each request. This
list will contain at least 50 records and could grow to about 300. My
options are either to pickle the list and store a single entry or use the
postgresql COPY command (as opposed to INSERT which would be too slow) to
store each of the temporary records.

You are writing a search engine : does that mean that you need to search
the
web and that you want to store your temporary results in a table, OR
does that mean that you are writing a QUERY screen, from which you
generate a SELECT statement to query your POSTGRES database ?

Also what size are your tuples ?

Do you need these temporary results within the same program, or do you
need to pass them somewhere to another program ?

Question is, how can I make an educated decision on which option to
select? What kind of questions should I be asking myself? Should I
actually go through the trouble of implementing both alternatives and
profiling each seperately? If so, how can I predict what will happen under
a heavy load which is hard to simulate when benchmarking each option?

Always go for a simple solution. This may (paradoxically) need some more
study. One of the first questions you should ask yourself, is it really
necessary to store this temporary result ? If so, then why take the pickle
option ? Pickling is meant for persistent data, which is really more a
mechanism
to store data between sessions. Maybe you should consider the option which
is used in traditional IT : just store your data in a sequential file. Much
less
overhead, because your OS handles it directly.

Concerning the benchmarking, it seems as if the only way to do this is to
automatically start scripts which do what needs to be done and then
measure what happens : nr of processes, CPU and IO-load.

Jurgen Defurne
defurnj@glo.be

Jurgen Defurne

defurnj@glo.be

almost 26 years ago

In reply to: Jurgen Defurne (#2)

Re: to pickle or not to pickle

Marc Tardif wrote:

I'm writing a search engine using python and postgresql which requires to
store a temporary list of results in an sql table for each request. This
list will contain at least 50 records and could grow to about 300. My
options are either to pickle the list and store a single entry or use the
postgresql COPY command (as opposed to INSERT which would be too slow) to
store each of the temporary records.

You are writing a search engine : does that mean that you need to search
the
web and that you want to store your temporary results in a table, OR
does that mean that you are writing a QUERY screen, from which you
generate a SELECT statement to query your POSTGRES database ?

Also what size are your tuples ?

Do you need these temporary results within the same program, or do you
need to pass them somewhere to another program ?

The former, search the web and store temporary results in a table. As for
the tuples, I can expect each to be <100bytes. Finally, the temporary
results will only be used by the same program.

If your temporary results ARE really to be used by the same program, then
I suggest that you use a solution whereby you keep your temp results
in a datastructure in memory, and not write them to any table or
temporary file. Python has enough basic and extended datastructures to do
that.

If your tuplesize is 100 bytes and you are sure that you have a maximum
of 300 tuples, then you will spend approximately 30 Kb of memory (not
counting run-time overhead). Using a simple list to store your data
will simplify your life much, and you don't need to worry about memory
management.

Good luck.

Jurgen Defurne
defurnj@glo.be

Import Notes

Reference msg id not found: Pine.LNX.4.10.10006031710060.1516-100000@Gloria.CAM.ORG | Resolved by subject fallback

Lincoln Yeoh

lylyeoh@mecomb.com

almost 26 years ago

In reply to: Marc Tardif (#1)

Re: to pickle or not to pickle

At 11:56 AM 31-05-2000 -0400, Marc Tardif wrote:

I'm writing a search engine using python and postgresql which requires to
store a temporary list of results in an sql table for each request. This
list will contain at least 50 records and could grow to about 300. My
options are either to pickle the list and store a single entry or use the
postgresql COPY command (as opposed to INSERT which would be too slow) to
store each of the temporary records.

Are you trying to do:

"showing 20 results" click next/previous for next/previous 20.

Whatever it is, I don't think you should use COPY.

The way I did it was to just do the query again, and only display the
relevant results, using offset and window values.

Not as efficient, but:
1) I wanted to know how many rows there were- so if I used SELECT .. LIMIT,
I'd have to do a SELECT count first, but AFAIK, Postgresql has not special
optimizations for SELECT count (not even sure if other databases would be
faster for _my_ SELECT count).

2) I didn't want to deal with cleaning up the cache/pickles... My app was
web based, so I don't know when the users have left. Say I expire the
cache/pickles after 15 minutes. If I have 100 searches per minute, I'd end
up having 1500 pickles at a time 8*). Not really a big problem nowadays,
but I didn't think it was worth dealing with.

3) It wasn't really a search engine- different results for different users,
different ways of sorting stuff etc.

But if your search engine returns the same result given the same query no
matter who the user is, the cache thing could be good. May mean a redesign-
have a cache table storing queries and results (and expiry). You will
probably require regular vacuuming, since the cache table will be changing
quite often.

e.g. each row:
query string, result1,result2, sequence, total results, expiry time.

By storing the total results you can use Postgresql's LIMIT feature more
intelligently. You can probably afford to waste the 4 bytes per row, and
keep everything in one table for speed.

Cheerio,

Link.

Richard Moon

richard@dcs.co.uk

almost 26 years ago

In reply to: Lincoln Yeoh (#4)

Re: to pickle or not to pickle

At 16:28 05/06/00 +0800, you wrote:

At 11:56 AM 31-05-2000 -0400, Marc Tardif wrote:

I'm writing a search engine using python and postgresql which requires to
store a temporary list of results in an sql table for each request. This
list will contain at least 50 records and could grow to about 300. My
options are either to pickle the list and store a single entry or use the
postgresql COPY command (as opposed to INSERT which would be too slow) to
store each of the temporary records.

Are you trying to do:

"showing 20 results" click next/previous for next/previous 20.

If you _are_ trying to do this, and if its a web-based development, you
might like to look at Zope (www.zope.org). It works well with PostgreSQL.
It will do this for you automatically. (Take a look at the SQL Methods docs
on that site). Its written in and uses python so you should feel really at
home. Let me know if you need any more help on Zope.

Richard

Richard Moon
richard@dcs.co.uk