atoi-like function: is there a better way to do this?

Started by Chris Angelicoabout 14 years ago8 messagesgeneral
Jump to latest
#1Chris Angelico
rosuav@gmail.com

One of our tables has a few columns that may be interpreted as strings
or may be numbers (data type is varchar, numbers are stored as
decimal). Generally, operations are performed on the string, but
sometimes we need to parse out a number - without it failing on error.
I wrote the following function to approximate to the semantics of
atoi:

create or replace function str2int(val varchar) returns bigint immutable as $$
begin
val=substring(val from '[0-9]*');
if length(val) between 1 and 19 then return val::bigint; end if;
return 0;
end;
$$ language plpgsql;

It uses a regular expression to chop the string down to just the
leading digits (leading only - '123.45' should parse as 123). Is there
a more efficient way to achieve this?

Thanks!

Chris Angelico

#2Chris Angelico
rosuav@gmail.com
In reply to: Chris Angelico (#1)
Re: atoi-like function: is there a better way to do this?

(Hoping you meant for that reply to be on-list as I'm here responding on-list.)

On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote:

On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote:

One of our tables has a few columns that may be interpreted as strings
or may be numbers (data type is varchar, numbers are stored as
decimal). Generally, operations are performed on the string, but
sometimes we need to parse out a number - without it failing on error.
I wrote the following function to approximate to the semantics of
atoi:

I would recommend against such a schema since different data types should warrant their own columns, but if you are left with no choice...

The values have to be strings for other reasons (eg '' is valid
everywhere, and this is subsequently processed by a script that
expects all strings). So yeah, no choice there. But I agree that
normally you DO want integers stored in integer columns, and we're
paying a performance penalty for this.

create or replace function str2int(val varchar) returns bigint immutable as $$
begin
      val=substring(val from '[0-9]*');
      if length(val) between 1 and 19 then return val::bigint; end if;
      return 0;
end;
$$ language plpgsql;

This can be written as:
select substring('35345345.45645654' from '\d{1,19}')::bigint;

Be aware that this does not account for:
3dogs
3 dogs
3,dogs
3.5.6.7

all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what you want. (?)

Cheers,
M

That is in fact the behaviour I want. Trouble is, that simpler version
returns NULL if given 'asdf' as its input - I want it to return 0.
It's also about the same performance (slightly slower in a quick test)
than the original; it's still doing a regular expression parse. I was
hoping very much to avoid the regex altogether.

ChrisA

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Chris Angelico (#2)
Re: atoi-like function: is there a better way to do this?

On Mar 4, 2012, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:

(Hoping you meant for that reply to be on-list as I'm here responding on-list.)

On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote:

On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote:

One of our tables has a few columns that may be interpreted as strings
or may be numbers (data type is varchar, numbers are stored as
decimal). Generally, operations are performed on the string, but
sometimes we need to parse out a number - without it failing on error.
I wrote the following function to approximate to the semantics of
atoi:

I would recommend against such a schema since different data types should warrant their own columns, but if you are left with no choice...

The values have to be strings for other reasons (eg '' is valid
everywhere, and this is subsequently processed by a script that
expects all strings). So yeah, no choice there. But I agree that
normally you DO want integers stored in integer columns, and we're
paying a performance penalty for this.

create or replace function str2int(val varchar) returns bigint immutable as $$
begin
val=substring(val from '[0-9]*');
if length(val) between 1 and 19 then return val::bigint; end if;
return 0;
end;
$$ language plpgsql;

This can be written as:
select substring('35345345.45645654' from '\d{1,19}')::bigint;

Be aware that this does not account for:
3dogs
3 dogs
3,dogs
3.5.6.7

all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what you want. (?)

Cheers,
M

That is in fact the behaviour I want. Trouble is, that simpler version
returns NULL if given 'asdf' as its input - I want it to return 0.

COALESCE is your friend

It's also about the same performance (slightly slower in a quick test)
than the original; it's still doing a regular expression parse. I was
hoping very much to avoid the regex altogether.

ChrisA

Any efficient, non-RegEx, alternative would require more context to evaluate than you provide. Mainly, would it be faster to have a separate field to store the parsed (at input) number and then query that field directly (even if it is a text field as well)? Basically cache the parse.

David J..

#4Chris Angelico
rosuav@gmail.com
In reply to: David G. Johnston (#3)
Re: atoi-like function: is there a better way to do this?

On Mon, Mar 5, 2012 at 2:50 PM, David Johnston <polobo@yahoo.com> wrote:

Any efficient, non-RegEx, alternative would require more context to evaluate than you provide.  Mainly, would it be faster to have a separate field to store the parsed (at input) number and then query that field directly (even if it is a text field as well)?  Basically cache the parse.

Caching's looking tempting, but I don't know if it'll be worth it
(these fields won't be searched-as-int very often compared to
search-as-string, and there's potentially a lot of such fields). All I
need out of it is the leading digits - I can strip them with trim(),
but I can't keep _only_ those digits.

The other possibility that may be of value is to write the function in
C instead of pl/pgsql, which will then actually call atoi() itself. Is
that going to be a better option?

ChrisA

#5Tom Molesworth
tom@audioboundary.com
In reply to: Chris Angelico (#4)
Re: atoi-like function: is there a better way to do this?

On 05/03/12 04:06, Chris Angelico wrote:

On Mon, Mar 5, 2012 at 2:50 PM, David Johnston<polobo@yahoo.com> wrote:

Any efficient, non-RegEx, alternative would require more context to evaluate than you provide. Mainly, would it be faster to have a separate field to store the parsed (at input) number and then query that field directly (even if it is a text field as well)? Basically cache the parse.

Caching's looking tempting, but I don't know if it'll be worth it
(these fields won't be searched-as-int very often compared to
search-as-string, and there's potentially a lot of such fields). All I
need out of it is the leading digits - I can strip them with trim(),
but I can't keep _only_ those digits.

The other possibility that may be of value is to write the function in
C instead of pl/pgsql, which will then actually call atoi() itself. Is
that going to be a better option?

Can you use to_number() here? It sounds like something along the lines
of cast(to_number('0' || field::varchar, '999999999.') as int) might
give the behaviour you're after, and a quick test seems to indicate that
it's about 4x faster than the original function:

postgres=# explain analyze select cast(to_number('0' ||
generate_series::varchar, '999999999.') as int) from
generate_series(1,1000000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..22.50 rows=1000 width=4)
(actual time=137.720..1065.752 rows=1000000 loops=1)
Total runtime: 1144.993 ms
(2 rows)

postgres=# explain analyze select str2int(generate_series::varchar) from
generate_series(1,1000000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..265.00 rows=1000
width=4) (actual time=135.180..4022.408 rows=1000000 loops=1)
Total runtime: 4121.233 ms
(2 rows)

Hopefully there's a cleaner way of writing that without a long list of
9s in the format string, and if the field is nullable I'd guess you
probably need a coalesce(..., 0) around that as well.

cheers,

Tom

#6Chris Angelico
rosuav@gmail.com
In reply to: Tom Molesworth (#5)
Re: atoi-like function: is there a better way to do this?

On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote:

Can you use to_number() here? It sounds like something along the lines of
cast(to_number('0' || field::varchar, '999999999.') as int) might give the
behaviour you're after, and a quick test seems to indicate that it's about
4x faster than the original function:

I looked at to_number but it seems to find digits anywhere inside the
field - to_number('12.34','99999') returns 1234, but I want it to stop
at the decimal.

Nice trick with prepending a zero though - I didn't think of that.
That may save some hassle!

ChrisA

#7David G. Johnston
david.g.johnston@gmail.com
In reply to: Chris Angelico (#6)
Re: atoi-like function: is there a better way to do this?

On Mar 5, 2012, at 0:08, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote:

Can you use to_number() here? It sounds like something along the lines of
cast(to_number('0' || field::varchar, '999999999.') as int) might give the
behaviour you're after, and a quick test seems to indicate that it's about
4x faster than the original function:

I looked at to_number but it seems to find digits anywhere inside the
field - to_number('12.34','99999') returns 1234, but I want it to stop
at the decimal.

Nice trick with prepending a zero though - I didn't think of that.
That may save some hassle!

ChrisA

0alpha999 -> 0
alpha999 -> 999

#8Tom Molesworth
tom@audioboundary.com
In reply to: Chris Angelico (#6)
Re: atoi-like function: is there a better way to do this?

On 05/03/12 05:08, Chris Angelico wrote:

On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth<tom@audioboundary.com> wrote:

Can you use to_number() here? It sounds like something along the lines of
cast(to_number('0' || field::varchar, '999999999.') as int) might give the
behaviour you're after, and a quick test seems to indicate that it's about
4x faster than the original function:

I looked at to_number but it seems to find digits anywhere inside the
field - to_number('12.34','99999') returns 1234, but I want it to stop
at the decimal.

Right - that's why I included the . in the format string. Haven't done
exhaustive testing but it seemed to do what you were after.

cheers,

Tom