Suggestions for the best strategy to emulate returning multiple sets of results

Started by Seref Arikanover 13 years ago11 messagesgeneral
Jump to latest
#1Seref Arikan
serefarikan@kurumsalteknoloji.com

Greetings,
I have a binary blog which is passed to a plpython function by a plpgsql
function. plpython is used to create 2 different transformations of this
binary blob to sets of postgresql type instances.
The flow is: blob -> plpython -> canonical python based data model ->
(set of db_type_As + set of db_type_Bs)
The problem is, transforming the binary blob to postgresql is expensive,
and a single binary blob is the source of two transformations. I have not
found a way of returning to sets of data form the plpython function.
At the moment, I have two options:
1) calling two functions in plpython that use the same blob and return
different sets of postgresql types (heavyweight transformation will happen
twice: bad)
2) creating two temp tables and calling the plpython function which in turn
writes to these temp tables, and then using the temp tables from plpgsql.

Do you think there are any other options that I might be missing? What
would be the most efficient way of passing temp tables to plpython function?

Kind regards
Seref

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Seref Arikan (#1)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

On Mon, Oct 8, 2012 at 3:14 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Greetings,
I have a binary blog which is passed to a plpython function by a plpgsql
function. plpython is used to create 2 different transformations of this
binary blob to sets of postgresql type instances.
The flow is: blob -> plpython -> canonical python based data model -> (set
of db_type_As + set of db_type_Bs)
The problem is, transforming the binary blob to postgresql is expensive, and
a single binary blob is the source of two transformations. I have not found
a way of returning to sets of data form the plpython function.
At the moment, I have two options:
1) calling two functions in plpython that use the same blob and return
different sets of postgresql types (heavyweight transformation will happen
twice: bad)
2) creating two temp tables and calling the plpython function which in turn
writes to these temp tables, and then using the temp tables from plpgsql.

Do you think there are any other options that I might be missing? What would
be the most efficient way of passing temp tables to plpython function?

Are the two sets the same size? If so, you probably want to do a
vanilla SRF. If not, consider a a composite containing arrays:

create type foo as(a int[], b int[]);

CREATE FUNCTION get_stuff()
RETURNS foo
AS $$
return [(1, 2, 3, 4, 5), (1,2,3)];
$$ LANGUAGE plpythonu;

select * from get_stuff();
postgres=# select * from get_stuff();
a | b
-------------+---------
{1,2,3,4,5} | {1,2,3}

merlin

#3Seref Arikan
serefarikan@kurumsalteknoloji.com
In reply to: Merlin Moncure (#2)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

Thanks Merlin,
I've tried arrays but plpython does not support returning arrays of custom
db types (which is what I'd need to do)

On Monday, 8 October 2012, Merlin Moncure wrote:

Show quoted text

On Mon, Oct 8, 2012 at 3:14 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com <javascript:;>> wrote:

Greetings,
I have a binary blog which is passed to a plpython function by a plpgsql
function. plpython is used to create 2 different transformations of this
binary blob to sets of postgresql type instances.
The flow is: blob -> plpython -> canonical python based data model ->

(set

of db_type_As + set of db_type_Bs)
The problem is, transforming the binary blob to postgresql is expensive,

and

a single binary blob is the source of two transformations. I have not

found

a way of returning to sets of data form the plpython function.
At the moment, I have two options:
1) calling two functions in plpython that use the same blob and return
different sets of postgresql types (heavyweight transformation will

happen

twice: bad)
2) creating two temp tables and calling the plpython function which in

turn

writes to these temp tables, and then using the temp tables from plpgsql.

Do you think there are any other options that I might be missing? What

would

be the most efficient way of passing temp tables to plpython function?

Are the two sets the same size? If so, you probably want to do a
vanilla SRF. If not, consider a a composite containing arrays:

create type foo as(a int[], b int[]);

CREATE FUNCTION get_stuff()
RETURNS foo
AS $$
return [(1, 2, 3, 4, 5), (1,2,3)];
$$ LANGUAGE plpythonu;

select * from get_stuff();
postgres=# select * from get_stuff();
a | b
-------------+---------
{1,2,3,4,5} | {1,2,3}

merlin

#4Merlin Moncure
mmoncure@gmail.com
In reply to: Seref Arikan (#3)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

On Mon, Oct 8, 2012 at 4:59 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Thanks Merlin,
I've tried arrays but plpython does not support returning arrays of custom
db types (which is what I'd need to do)

hm -- yeah. can your custom types be broken down into plain SQL types
(that is, composite types?). maybe stash the results in global
variable and return it in two calls, or insert into into a tempt
table that drops on commit?

merlin

#5Seref Arikan
serefarikan@kurumsalteknoloji.com
In reply to: Merlin Moncure (#4)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

Hi Merlin,
Thanks for the response. At the moment, the main function is creating two
temp tables that drops on commit, and python functions fills these. Not too
bad, but I'd like to push these temp tables to ram, which is a bit tricky
due to not having a direct method of doing this with postgresql. (a topic
that has been discussed in the past in this mail group)

The global variable idea is interesting though. I have not encountered this
before, is it the global dictionary SD/GD mentioned here:
http://www.postgresql.org/docs/9.0/static/plpython-sharing.html ?
It may help perform the expensive transformations once and reuse the
results.

Kind regards
Seref

On Wed, Oct 10, 2012 at 2:06 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

Show quoted text

On Mon, Oct 8, 2012 at 4:59 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Thanks Merlin,
I've tried arrays but plpython does not support returning arrays of

custom

db types (which is what I'd need to do)

hm -- yeah. can your custom types be broken down into plain SQL types
(that is, composite types?). maybe stash the results in global
variable and return it in two calls, or insert into into a tempt
table that drops on commit?

merlin

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Seref Arikan (#5)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

On Wed, Oct 10, 2012 at 8:27 AM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Hi Merlin,
Thanks for the response. At the moment, the main function is creating two
temp tables that drops on commit, and python functions fills these. Not too
bad, but I'd like to push these temp tables to ram, which is a bit tricky
due to not having a direct method of doing this with postgresql. (a topic
that has been discussed in the past in this mail group)

The global variable idea is interesting though. I have not encountered this
before, is it the global dictionary SD/GD mentioned here:
http://www.postgresql.org/docs/9.0/static/plpython-sharing.html ?
It may help perform the expensive transformations once and reuse the
results.

yeah. maybe though you might find that the overhead of temp tables is
already pretty good -- they are mostly ram based in typical usage as
they aren't synced. I find actually the greatest overhead in terms of
using them is creation and dropping -- so for very low latency
transactions I use a unlogged permanent table with value returned by
txid_current() as the leading field in the key.

merlin

#7Seref Arikan
serefarikan@kurumsalteknoloji.com
In reply to: Merlin Moncure (#6)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

Comments inline (sorry, did not cc the group in the other mail)

On Wed, Oct 10, 2012 at 2:55 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 10, 2012 at 8:27 AM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Hi Merlin,
Thanks for the response. At the moment, the main function is creating two
temp tables that drops on commit, and python functions fills these. Not

too

bad, but I'd like to push these temp tables to ram, which is a bit tricky
due to not having a direct method of doing this with postgresql. (a topic
that has been discussed in the past in this mail group)

The global variable idea is interesting though. I have not encountered

this

before, is it the global dictionary SD/GD mentioned here:
http://www.postgresql.org/docs/9.0/static/plpython-sharing.html ?
It may help perform the expensive transformations once and reuse the
results.

yeah. maybe though you might find that the overhead of temp tables is
already pretty good -- they are mostly ram based in typical usage as
they aren't synced. I find actually the greatest overhead in terms of
using them is creation and dropping -- so for very low latency
transactions I use a unlogged permanent table with value returned by
txid_current() as the leading field in the key.

This is very interesting. The reason I've tried to avoid a shared temp
table is that I'd have to have a session id for calls, which led to severe
performance issues with the entity attribute value approach I'm using in
the temp table.
Your approach sounds to have been designed to overcome my problem, but I
have no idea what an unlogged table does, and your use of txid_current.
Could you explain a bit?

Regards
Seref

Show quoted text

merlin

#8Bret Stern
bret_stern@machinemanagement.com
In reply to: Seref Arikan (#5)
Re: Re: Suggestions for the best strategy to emulate returning multiple sets of results

create a ramdrive

Show quoted text

On Wed, 2012-10-10 at 14:27 +0100, Seref Arikan wrote:

Hi Merlin,
Thanks for the response. At the moment, the main function is creating
two temp tables that drops on commit, and python functions fills
these. Not too bad, but I'd like to push these temp tables to ram,
which is a bit tricky due to not having a direct method of doing this
with postgresql. (a topic that has been discussed in the past in this
mail group)

The global variable idea is interesting though. I have not encountered
this before, is it the global dictionary SD/GD mentioned here:
http://www.postgresql.org/docs/9.0/static/plpython-sharing.html ?
It may help perform the expensive transformations once and reuse the
results.

Kind regards
Seref

On Wed, Oct 10, 2012 at 2:06 PM, Merlin Moncure <mmoncure@gmail.com>
wrote:
On Mon, Oct 8, 2012 at 4:59 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Thanks Merlin,
I've tried arrays but plpython does not support returning

arrays of custom

db types (which is what I'd need to do)

hm -- yeah. can your custom types be broken down into plain
SQL types
(that is, composite types?). maybe stash the results in
global
variable and return it in two calls, or insert into into a
tempt
table that drops on commit?

merlin

#9Seref Arikan
serefarikan@kurumsalteknoloji.com
In reply to: Bret Stern (#8)
Re: Re: Suggestions for the best strategy to emulate returning multiple sets of results

Thanks Bret,
I'm concerned about what happens when my functions under high load fills
the ramdrive with temporary tables I'm using. The advantage of telling
postgres to use ram with an option to fall back to disk is significantly
better in terms of uptime.
However, I was thinking about some mechanism in the middle tier that
watches the space in the ram drive and redirects queries to functions that
create temp tables on disk, if ram drive is close to full. That may help me
accomplish what I'm trying to

Regards
Seref

On Wed, Oct 10, 2012 at 3:58 PM, Bret Stern <
bret_stern@machinemanagement.com> wrote:

Show quoted text

create a ramdrive
On Wed, 2012-10-10 at 14:27 +0100, Seref Arikan wrote:

Hi Merlin,
Thanks for the response. At the moment, the main function is creating
two temp tables that drops on commit, and python functions fills
these. Not too bad, but I'd like to push these temp tables to ram,
which is a bit tricky due to not having a direct method of doing this
with postgresql. (a topic that has been discussed in the past in this
mail group)

The global variable idea is interesting though. I have not encountered
this before, is it the global dictionary SD/GD mentioned here:
http://www.postgresql.org/docs/9.0/static/plpython-sharing.html ?
It may help perform the expensive transformations once and reuse the
results.

Kind regards
Seref

On Wed, Oct 10, 2012 at 2:06 PM, Merlin Moncure <mmoncure@gmail.com>
wrote:
On Mon, Oct 8, 2012 at 4:59 PM, Seref Arikan
<serefarikan@kurumsalteknoloji.com> wrote:

Thanks Merlin,
I've tried arrays but plpython does not support returning

arrays of custom

db types (which is what I'd need to do)

hm -- yeah. can your custom types be broken down into plain
SQL types
(that is, composite types?). maybe stash the results in
global
variable and return it in two calls, or insert into into a
tempt
table that drops on commit?

merlin

#10Jasen Betts
jasen@xnet.co.nz
In reply to: Seref Arikan (#1)
Re: Suggestions for the best strategy to emulate returning multiple sets of results

On 2012-10-10, Seref Arikan <serefarikan@kurumsalteknoloji.com> wrote:

--f46d0443048225e0e704cbb5e0ee
Content-Type: text/plain; charset=ISO-8859-1

Thanks Bret,
I'm concerned about what happens when my functions under high load fills
the ramdrive with temporary tables I'm using. The advantage of telling
postgres to use ram with an option to fall back to disk is significantly
better in terms of uptime.
However, I was thinking about some mechanism in the middle tier that
watches the space in the ram drive and redirects queries to functions that
create temp tables on disk, if ram drive is close to full. That may help me
accomplish what I'm trying to

That's what operating systems are for, ramdisk is only ever a hint,
is ram is short it will wind up in swap, if ram is plentiful a disk
table will be fully buffered in ram.

--
⚂⚃ 100% natural

#11Seref Arikan
serefarikan@gmail.com
In reply to: Jasen Betts (#10)
Re: Re: Suggestions for the best strategy to emulate returning multiple sets of results

Jasen,
Thanks for this. My last use of ramdisk was ages ago and I've always had
the idea that it was just a disk in ram with no capability to spill over to
disk.
It appears the mind refuses to acknowledge that this has been the situation
many years ago :)

Some google searches returned others asking the same question, surely
someone must have properly established this under *nix. I'll keep
searching, and post my solution for feedback.

Kind regards
Seref

On Sun, Oct 14, 2012 at 5:54 AM, Jasen Betts <jasen@xnet.co.nz> wrote:

Show quoted text

On 2012-10-10, Seref Arikan <serefarikan@kurumsalteknoloji.com> wrote:

--f46d0443048225e0e704cbb5e0ee
Content-Type: text/plain; charset=ISO-8859-1

Thanks Bret,
I'm concerned about what happens when my functions under high load fills
the ramdrive with temporary tables I'm using. The advantage of telling
postgres to use ram with an option to fall back to disk is significantly
better in terms of uptime.
However, I was thinking about some mechanism in the middle tier that
watches the space in the ram drive and redirects queries to functions

that

create temp tables on disk, if ram drive is close to full. That may help

me

accomplish what I'm trying to

That's what operating systems are for, ramdisk is only ever a hint,
is ram is short it will wind up in swap, if ram is plentiful a disk
table will be fully buffered in ram.

--
⚂⚃ 100% natural

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general