TODO item: Implement Boyer-Moore searching (First time hacker)

Started by David Rowleyover 17 years ago21 messages

dgrowley@gmail.com

over 17 years ago

1 attachment(s)

Reference: Bruce Momjian writes: ->
http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? ->
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-exa
mple.html

As a first time hacker of postgresql I've decided to attempt something that
does not go too deep in to the heart of the software.

Before I make an attempt at writing a patch, I've decided to start
benchmarking outside of Postgres. On reading the link underneath the Boyer
Moore item in the in the new TODO list wiki, I learned that someone else had
a shot at speeding up strpos around this time last year using KMP algotithm.
On looking a little deeper into Boyer Moore's algorithm it's quite obvious
that there is some overhead involved with preprocessing the search key
(needle). My target has been to not have any trade offs with this possibly
to be patch. I didn't want smaller searches to suffer.

I'd like to receive some feedback on the various functions in the attached
before I go ahead and work on this any more. I've written 8 different
versions of the function. Version 8 seems to be the fastest out of them,
however I've not tested with multi byte chars.

Please note that these are just some initial ideas. For example startpos is
not yet implemented, but there is very little overhead to that anyway.

For comparision sake I mocked up the current postgresql 8.3's strpos() from
varlena.c. I've no idea how accuratly this will be to the real version.
Probably find out once I write the patch. I replaced strncmp with my own
version as I was having problems with my compilers optimizer, it optimized
too much and I was unable to benchmark. I felt running without and
optimization was unfair as strncmp is. Perhaps if someone else wants to
benchmark with -O2 on gcc they should first replace pg_strncmp with strncmp.

Currently I have a single file which can be compiled itself that contains
benchmarks for the functions I've tested so far. All of these implement a
stripped out version of the Boyer Moore algotithm. I've done away with one
of the character maps in a bid to reduce the overhead involved creating
them.

The position calculation will likely need changed for multi byte characters.
I'm suspecting I'm not meant to be doing length calculations on pointers.
Can someone confirm?

To keep this email short I've put most of the information required in as
comments into the source of the program.

I've uploaded some benchmark results using the attached program. The
benchmarking was done on windows.

Please see http://www.unixbeast.com/~fat/test1.html there are 10 tests in
total, in each I attempt to exploit weaknesses, some with my functions and
some with the existing function.

My method for version 8 of the function probably looks quite unusual as it
does some simple cost based optimisation

I look forward to receiving feedback on this.

David.

Peter Eisentraut

peter_e@gmx.net

over 17 years ago

In reply to: David Rowley (#1)

Re: TODO item: Implement Boyer-Moore searching (First time hacker)

David Rowley wrote:

Reference: Bruce Momjian writes: ->
http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? ->
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

I look forward to receiving feedback on this.

Please send a patch in diff -c format. And don't put a single patch
file in a tar file.

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Peter Eisentraut (#2)

1 attachment(s)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

I wrote:

I look forward to receiving feedback on this.

Peter Eisentraut wrote:

Please send a patch in diff -c format. And don't put a single patch
file in a tar file.

Thanks for the pointer. I'm quite new to this.

I've done some more revisions to the patch. This has mostly just involved
tuning the skip table size based on the size of the search. This has
basically involved lots of benchmarks with different strings and calculating
the best size of table to use. The reason for this is to maintain fast
searches for smaller strings. The overhead of initialising a 256 element
array would probably out weigh the cost of the search if this were not done.
The size of the skip table increases with longer strings, or rather the size
that is utilised.

Performance:
For smaller searches performance of the patch and existing version are very
similar. The patched version starts to out perform the existing version when
the needle and haystack become larger.

The patch wins hands down with searches that leads the existing function in
to dead ends, for example:

SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');

When searching for very small strings, say just a single character in a
sentence the existing function marginally beats the patched version.

Outside of Postgres I've done benchmarks where I've searched for every
combination of the search string in the search string. Like:

test | t |
test | te |
test | tes |
test | test |
test | e |
test | es |
test | est |
test | s |
test | st |
test | t |

I felt this was fair for both versions. The patched version beat the
unpatched version. The test I carried out was a string of 934 characters.

I can upload more benchmarks if required.

I'm quite happy with the patch now so I'm going to submit it (in diff -c
format this time :)

David.

Attachments:

boyer-moore_strpos_v1.1.patchapplication/octet-stream; name=boyer-moore_strpos_v1.1.patchDownload

*** varlena.c	2008-09-02 23:45:48.000000000 +0100
--- varlena.c	2008-09-02 23:38:15.000000000 +0100
***************
*** 840,901 ****
  	}
  }
  
! static int
  text_position_next(int start_pos, TextPositionState *state)
  {
! 	int			pos = 0,
! 				p,
! 				px;
  
  	Assert(start_pos > 0);		/* else caller error */
  
! 	if (state->len2 <= 0)
! 		return start_pos;		/* result for empty pattern */
  
- 	if (!state->use_wchar)
- 	{
- 		/* simple case - single byte encoding */
- 		char	   *p1 = state->str1;
- 		char	   *p2 = state->str2;
  
! 		/* no use in searching str past point where search_str will fit */
! 		px = (state->len1 - state->len2);
  
- 		p1 += start_pos - 1;
  
! 		for (p = start_pos - 1; p <= px; p++)
! 		{
! 			if ((*p1 == *p2) && (strncmp(p1, p2, state->len2) == 0))
! 			{
! 				pos = p + 1;
! 				break;
! 			}
! 			p1++;
! 		}
! 	}
! 	else
! 	{
! 		/* not as simple - multibyte encoding */
! 		pg_wchar   *p1 = state->wstr1;
! 		pg_wchar   *p2 = state->wstr2;
  
- 		/* no use in searching str past point where search_str will fit */
- 		px = (state->len1 - state->len2);
  
! 		p1 += start_pos - 1;
  
! 		for (p = start_pos - 1; p <= px; p++)
! 		{
! 			if ((*p1 == *p2) && (pg_wchar_strncmp(p1, p2, state->len2) == 0))
! 			{
! 				pos = p + 1;
! 				break;
! 			}
! 			p1++;
! 		}
! 	}
  
! 	return pos;
  }
  
  static void
--- 840,1027 ----
  	}
  }
  
! /* text_position_next
!  * David Rowley 2008-09-01
!  * Uses Boyer-Moore searching
!  */
! int
  text_position_next(int start_pos, TextPositionState *state)
  {
!   
!   int skiptable[256];
!   int ai;
!   int hashsize;
! 
!   /*
!    * state->len2 = needle length in chars 
!    * state->len1 = haystack length in chars 
!    */
  
  	Assert(start_pos > 0);		/* else caller error */
  
!   if (state->len2 <= 0)
! 		return start_pos;		/* Empty needle, found it! */
  
  
!   /* Changed for zero based arrays */
!   start_pos--;
  
  
!   /* Eliminate the impossible first, the needle is
!    * too big for the haystack.
!    */
!   if (state->len1 + start_pos < state->len2)
!     return 0;
! 
!   /* Here we must determine how much of the skip table we should use.
!    * With haystack lengths of only a few characters we don't really want
!    * to have to initalize the full 256 elements of the table. Such an
!    * initialization would take too long in comparison to the rest of the
!    * search time.
!    * To get around this problem when searching for smaller strings we only
!    * use part of the skiptable, the larger the search (or at least the longer
!    * we estimate that it will take) the more we use of the skip table.
!    *
!    * After quite a few benchmarks a rule of thumb seems to show that the
!    * best skip table size is around 10 times less than the length of the
!    * haystack. Of course this is not an exact science. The needle length
!    * counts and also the mix of characters in both the needle and the haystack.
!    * It might look like a good idea to make the skip table around the same size
!    * as the needle, however that likely means more collisions or possibly no 
!    * empty positions in the table. That would mean we'd never be able to
!    * skip the length of the needle.
!    * 
!    * We skiptable[<char value> & hashsize] to tell us how many characters
!    * we should skip ahead in the haystack when we don't find a match.
!    * It does not matter that it's a lossy hash check as we're re-checking the
!    * actual character value again once the loop reiterates.
!    *
!    * The above method works both for char and pg_wchar. Perhaps with UTF8
!    * the character mix will be greater.
!    *
!    * The following code calculates the hashsize. Notice that for small 
!    * searches we're bearly using any of the skiptable. We also allocate
!    * the hash size to smaller searches first to allow them to get on 
!    * searching...
!    *
!    */
! 
!   if ((state->len2 - start_pos) < 16)
!     hashsize = 3;
!   else if ((state->len2 - start_pos) < 64)
!     hashsize = 7;
!   else if ((state->len2 - start_pos) < 128)
!     hashsize = 15;
!   else if ((state->len2 - start_pos) < 512)
!     hashsize = 31;
!   else if ((state->len2 - start_pos) < 2048)
!     hashsize = 63;
!   else if ((state->len2 - start_pos) < 4096)
!     hashsize = 127;
!   else
!     hashsize = 255;
! 
!  
!   /* Initalize the skip table. We set all elements to the needle length */
!   for (ai = 0; ai <= hashsize; ai++)
!     skiptable[ai] = state->len2;
  
  
! 	if (!state->use_wchar)
!   {
!     char *nptr;
!     char *hptr;
!     char *p;
!     char *needle = state->str2;
!     char *haystack = state->str1;
! 
!     /* Here we process the needle marking the last occurence
!      * of each character (ignoring the very last character)
!      */
!     for (ai = 0; ai < state->len2 - 1; ai++)
!       skiptable[(unsigned char) needle[ai] & hashsize] = state->len2 - ai - 1;
! 
!   
!     /* Start at startpos plus the length of the needle */
!     hptr = &haystack[state->len2 - 1 + start_pos];
! 
!     while (hptr < &haystack[state->len1]) 
!     {
!       nptr = &needle[state->len2 - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
! 
!         p--;
!       }
! 
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
! 
!       hptr += skiptable[(unsigned char) *hptr & hashsize];
!     }
! 
!     return 0; /* Not found */
! 
!   } 
!   
!   else
!   { 
!     /* The multibyte char version. This works exactly the same way.
!      */
!     pg_wchar *nptr;
!     pg_wchar *hptr;
!     pg_wchar *p;
!     pg_wchar *needle = state->wstr2;
!     pg_wchar *haystack = state->wstr1;
! 
!     /* Here we process the needle marking the last occurence
!      * of each character (ignoring the very last character)
!      */
!     for (ai = 0; ai < state->len2 - 1; ai++)
!       skiptable[needle[ai] & hashsize] = state->len2 - ai - 1;
! 
! 
!     /* Start at start_pos plus the length of the needle */
!     hptr = &haystack[state->len2 - 1 + start_pos];
! 
!     while (hptr < &haystack[state->len1]) 
!     {
!       nptr = &needle[state->len2 - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!   
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
! 
!         p--;
!       }
!  
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
  
!        hptr += skiptable[*hptr & hashsize];
!     }
  
!     return 0; /* Not found */
!   }
  }
  
  static void

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: David Rowley (#3)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

After reading the wikipedia article on Boyer-Moore search algorithm, it
looks to me like this patch actually implements the simpler
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
probably fine, as it ought to be faster on small needles and haystacks
because it requires less effort to build the lookup tables, even though
the worst-case performance is worse. It should still be faster than what
we have now.

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

David Rowley wrote:

I've done some more revisions to the patch. This has mostly just involved
tuning the skip table size based on the size of the search. This has
basically involved lots of benchmarks with different strings and calculating
the best size of table to use. The reason for this is to maintain fast
searches for smaller strings. The overhead of initialising a 256 element
array would probably out weigh the cost of the search if this were not done.
The size of the skip table increases with longer strings, or rather the size
that is utilised.

Performance:
For smaller searches performance of the patch and existing version are very
similar. The patched version starts to out perform the existing version when
the needle and haystack become larger.

The patch wins hands down with searches that leads the existing function in
to dead ends, for example:

SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');

When searching for very small strings, say just a single character in a
sentence the existing function marginally beats the patched version.

Outside of Postgres I've done benchmarks where I've searched for every
combination of the search string in the search string. Like:

test | t |
test | te |
test | tes |
test | test |
test | e |
test | es |
test | est |
test | s |
test | st |
test | t |

I felt this was fair for both versions. The patched version beat the
unpatched version. The test I carried out was a string of 934 characters.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#4)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

After reading the wikipedia article on Boyer-Moore search algorithm, it
looks to me like this patch actually implements the simpler
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
probably fine, as it ought to be faster on small needles and haystacks
because it requires less effort to build the lookup tables, even though
the worst-case performance is worse. It should still be faster than what
we have now.

Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N).
Maybe we should build the second table when M is large? Although
realistically that is probably gilding the lily, since frankly there
haven't been many real-world complaints about the speed of these
functions anyway ...

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

+1. The heuristic about big a skip table to make may need some
adjustment as well, since it seems to be considering only a single
search.

regards, tom lane

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Heikki Linnakangas (#4)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas wrote:

After reading the wikipedia article on Boyer-Moore search algorithm, it
looks to me like this patch actually implements the simpler
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
probably fine, as it ought to be faster on small needles and haystacks
because it requires less effort to build the lookup tables, even though
the worst-case performance is worse. It should still be faster than what
we have now.

Yes, correct, I really didn't want to slow down smaller searches. This
method seemed to fit the bill better. What I didn't realise is that this
method also had a name.

Tom Lane wrote:

Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N). Maybe we
should build the second table when M is large?

I'll look into this. If it pays off for longer searches I'll submit a patch.
I won't have the time until after the 15th, so perhaps that's in November's
commit fest?

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

Of course you are right. That will help for replace and the like. I'll
update the patch tonight.

Thanks both for the feedback.

David.

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Rowley (#6)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

"David Rowley" <dgrowley@gmail.com> writes:

Tom Lane wrote:

Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N). Maybe we
should build the second table when M is large?

I'll look into this. If it pays off for longer searches I'll submit a patch.
I won't have the time until after the 15th, so perhaps that's in November's
commit fest?

Yes. Let's get B-M-H in during this fest and then you can look into
whether a follow-on patch is worthwhile.

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#7)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Tom Lane wrote:

"David Rowley" <dgrowley@gmail.com> writes:

Tom Lane wrote:

Hmm. B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N). Maybe we
should build the second table when M is large?

I'll look into this. If it pays off for longer searches I'll submit a patch.
I won't have the time until after the 15th, so perhaps that's in November's
commit fest?

Yes. Let's get B-M-H in during this fest and then you can look into
whether a follow-on patch is worthwhile.

Agreed.

Also, it would be nice to use B-M(-H) for LIKE as well. That's a
different codepath, but probably more widely used than textpos and
frients. I haven't looked how feasible it would be to do, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#8)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Also, it would be nice to use B-M(-H) for LIKE as well.

Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?

Anyway, it's not material for this patch, since it'd involve pretty
fundamental redesign of the LIKE code.

regards, tom lane

#10

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Tom Lane (#5)

1 attachment(s)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas wrote:

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

I Wrote:

Of course you are right. That will help for replace and the like. I'll
update the patch tonight.

I've made and attached the changes Heikki recommended.
Also updated benchmark spreadsheet. Here ->
http://www.unixbeast.com/~fat/8.3_test_v1.2.xls

Previously there was an error with "Test 6", the test that benchmarked
replace(). I kept this one so not to affect the summary result in the new
sheet. I then added the sheet "Replace Test" to show more accurate results.
I had failed to notice that the optimizer was helping me out more than I
wanted it to.

My tested replace() script runs in 91% of the time than the 8.3 version.
I've not tested with the CVS head.

Now that the skip table is a member of TextPositionState, I was not quite
sure if I should #define a size for it. It would certainly look neater, only
the code that defines the skip table size in text_position_start assumes 256
elements.

Any thoughts on this?

David.

Attachments:

boyer-moore-strpos_v1.2.patchapplication/octet-stream; name=boyer-moore-strpos_v1.2.patchDownload

*** varlena.c	2008-08-31 20:31:56.000000000 +0100
--- varlena.c	2008-09-05 19:41:04.000000000 +0100
***************
*** 40,45 ****
--- 40,48 ----
  	pg_wchar   *wstr2;			/* note: these are palloc'd */
  	int			len1;			/* string lengths in logical characters */
  	int			len2;
+ 	int	skiptable[256];			/* For boyer-moore searching */
+ 	int	skiptablesize;			/* How much of the skiptable do we utilize */
+ 
  } TextPositionState;
  
  #define DatumGetUnknownP(X)			((unknown *) PG_DETOAST_DATUM(X))
***************
*** 804,816 ****
   * called multiple times with increasing values of start_pos, which is
   * the 1-based character position to start the search from.  The "state"
   * variable is normally just a local variable in the caller.
   */
- 
  static void
  text_position_setup(text *t1, text *t2, TextPositionState *state)
  {
  	int			len1 = VARSIZE_ANY_EXHDR(t1);
  	int			len2 = VARSIZE_ANY_EXHDR(t2);
  
  	if (pg_database_encoding_max_length() == 1)
  	{
--- 807,821 ----
   * called multiple times with increasing values of start_pos, which is
   * the 1-based character position to start the search from.  The "state"
   * variable is normally just a local variable in the caller.
+  *
+  * David Rowley: 2008-09-04 -- Updated for Boyer-Moore searching
   */
  static void
  text_position_setup(text *t1, text *t2, TextPositionState *state)
  {
  	int			len1 = VARSIZE_ANY_EXHDR(t1);
  	int			len2 = VARSIZE_ANY_EXHDR(t2);
+ 	int			ai;
  
  	if (pg_database_encoding_max_length() == 1)
  	{
***************
*** 820,825 ****
--- 825,880 ----
  		state->str2 = VARDATA_ANY(t2);
  		state->len1 = len1;
  		state->len2 = len2;
+ 
+ 		/* Here we must determine how much of the skip table we should use.
+ 		 * With haystack lengths of only a few characters we don't really want
+ 		 * to have to initalize the full 256 elements of the table. Such an
+ 		 * initialization would take too long in comparison to the rest of the
+ 		 * search time.
+ 		 * To get around this problem when searching for smaller strings we only
+ 		 * use part of the skiptable, the larger the search (or at least the longer
+ 		 * we estimate that it will take) the more we use of the skip table.
+ 		 *
+ 		 * After quite a few benchmarks a rule of thumb seems to show that the
+ 		 * best skip table size is around 10 times less than the length of the
+ 		 * haystack. Of course this is not an exact science. The needle length
+ 		 * counts and also the mix of characters in both the needle and the haystack.
+ 		 * It might look like a good idea to make the skip table around the same size
+ 		 * as the needle, however that likely means more collisions or possibly no 
+ 		 * empty positions in the table. That would mean we'd never be able to
+ 		 * skip the length of the needle.
+ 		 * 
+ 		 * The following code calculates the skip table's size. Notice that for small 
+ 		 * searches we're bearly using any of the skiptable. 
+ 		 *
+ 		 */
+ 
+ 		if (state->len2 < 16)
+ 			state->skiptablesize = 3;
+ 		else if (state->len2 < 64)
+ 			state->skiptablesize = 7;
+ 		else if (state->len2 < 128)
+ 			state->skiptablesize = 15;
+ 		else if (state->len2 < 512)
+ 			state->skiptablesize = 31;
+ 		else if (state->len2 < 2048)
+ 			state->skiptablesize = 63;
+ 		else if (state->len2 < 4096)
+ 			state->skiptablesize = 127;
+ 		else
+ 			state->skiptablesize = 255;
+ 
+ 		/* Initalize the skip table. We set all elements to the needle length */
+ 		for (ai = 0; ai <= state->skiptablesize; ai++)
+ 			state->skiptable[ai] = state->len2;
+ 
+ 		/* Here we process the needle marking the last occurence
+ 		 * of each character (ignoring the very last character)
+ 		 */
+ 		for (ai = 0; ai < state->len2 - 1; ai++)
+ 			state->skiptable[(unsigned char) state->str2[ai] & state->skiptablesize] = state->len2 - ai - 1;
+ 
+ 
  	}
  	else
  	{
***************
*** 837,903 ****
  		state->wstr2 = p2;
  		state->len1 = len1;
  		state->len2 = len2;
  	}
  }
  
! static int
  text_position_next(int start_pos, TextPositionState *state)
  {
! 	int			pos = 0,
! 				p,
! 				px;
  
  	Assert(start_pos > 0);		/* else caller error */
  
- 	if (state->len2 <= 0)
- 		return start_pos;		/* result for empty pattern */
  
! 	if (!state->use_wchar)
! 	{
! 		/* simple case - single byte encoding */
! 		char	   *p1 = state->str1;
! 		char	   *p2 = state->str2;
  
- 		/* no use in searching str past point where search_str will fit */
- 		px = (state->len1 - state->len2);
  
! 		p1 += start_pos - 1;
  
- 		for (p = start_pos - 1; p <= px; p++)
- 		{
- 			if ((*p1 == *p2) && (strncmp(p1, p2, state->len2) == 0))
- 			{
- 				pos = p + 1;
- 				break;
- 			}
- 			p1++;
- 		}
- 	}
- 	else
- 	{
- 		/* not as simple - multibyte encoding */
- 		pg_wchar   *p1 = state->wstr1;
- 		pg_wchar   *p2 = state->wstr2;
  
! 		/* no use in searching str past point where search_str will fit */
! 		px = (state->len1 - state->len2);
  
- 		p1 += start_pos - 1;
  
! 		for (p = start_pos - 1; p <= px; p++)
! 		{
! 			if ((*p1 == *p2) && (pg_wchar_strncmp(p1, p2, state->len2) == 0))
! 			{
! 				pos = p + 1;
! 				break;
! 			}
! 			p1++;
! 		}
! 	}
  
! 	return pos;
  }
  
  static void
  text_position_cleanup(TextPositionState *state)
  {
--- 892,1043 ----
  		state->wstr2 = p2;
  		state->len1 = len1;
  		state->len2 = len2;
+ 
+ 
+ 		if (state->len2 < 16)
+ 			state->skiptablesize = 3;
+ 		else if (state->len2 < 64)
+ 			state->skiptablesize = 7;
+ 		else if (state->len2 < 128)
+ 			state->skiptablesize = 15;
+ 		else if (state->len2 < 512)
+ 			state->skiptablesize = 31;
+ 		else if (state->len2 < 2048)
+ 			state->skiptablesize = 63;
+ 		else if (state->len2 < 4096)
+ 			state->skiptablesize = 127;
+ 		else
+ 			state->skiptablesize = 255;
+ 
+ 		/* Initalize the skip table. We set all elements to the needle length */
+ 		for (ai = 0; ai <= state->skiptablesize; ai++)
+ 			state->skiptable[ai] = state->len2;
+ 
+ 		/* Here we process the needle marking the last occurence
+ 		 * of each character (ignoring the very last character)
+ 		 */
+ 		for (ai = 0; ai < state->len2 - 1; ai++)
+ 			state->skiptable[state->wstr2[ai] & state->skiptablesize] = state->len2 - ai - 1;
+ 
  	}
  }
  
! /* text_position_next
!  * David Rowley 2008-09-05
!  * Uses Boyer-Moore searching
!  */
! int
  text_position_next(int start_pos, TextPositionState *state)
  {
!   /*
!    * state->len2 = needle length in chars 
!    * state->len1 = haystack length in chars 
!    */
  
  	Assert(start_pos > 0);		/* else caller error */
  
  
!   if (state->len2 <= 0)
! 		return start_pos;		/* Empty needle, found it! */
  
  
!   /* Changed for zero based arrays */
!   start_pos--;
  
  
!   /* Eliminate the impossible first, the needle is
!    * too big for the haystack.
!    */
!   if (state->len1 + start_pos < state->len2)
!     return 0;
  
  
! 	if (!state->use_wchar)
!   {
!     char *nptr;
!     char *hptr;
!     char *p;
!     char *needle = state->str2;
!     char *haystack = state->str1;
! 
!  
!     /* Start at startpos plus the length of the needle */
!     hptr = &haystack[state->len2 - 1 + start_pos];
! 
!     while (hptr < &haystack[state->len1]) 
!     {
!       nptr = &needle[state->len2 - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
! 
!         p--;
!       }
! 
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
! 
!       hptr += state->skiptable[(unsigned char) *hptr & state->skiptablesize];
!     }
! 
!     return 0; /* Not found */
! 
!   } 
!   
!   else
!   { 
!     /* The multibyte char version. This works exactly the same way.
!      */
!     pg_wchar *nptr;
!     pg_wchar *hptr;
!     pg_wchar *p;
!     pg_wchar *needle = state->wstr2;
!     pg_wchar *haystack = state->wstr1;
! 
!   
!     /* Start at start_pos plus the length of the needle */
!     hptr = &haystack[state->len2 - 1 + start_pos];
! 
!     while (hptr < &haystack[state->len1]) 
!     {
!       nptr = &needle[state->len2 - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
! 
!         p--;
!       }
!  
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
! 
!        hptr += state->skiptable[*hptr & state->skiptablesize];
!     }
  
!     return 0; /* Not found */
!   }
  }
  
+ 
+ 
  static void
  text_position_cleanup(TextPositionState *state)
  {

#11

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Rowley (#10)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

"David Rowley" <dgrowley@gmail.com> writes:

I've made and attached the changes Heikki recommended.

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

Stylistic issues:

I really dislike having two copies of the heuristic size-setting code.
I think it's worth introducing a second test of use_wchar in order to
arrange text_position_setup like this:

... existing code ...

choose skiptablesize

initialize skip table (this loop doesn't depend on char width)

if (!state->use_wchar)
process char needle
else
process wide-char needle

Also it might be worth having local-variable copies of skiptablesize and
len2 for this initialization loop:

for (ai = 0; ai <= state->skiptablesize; ai++)
state->skiptable[ai] = state->len2;

I'm not at all sure whether a C compiler will think that it's allowed to
avoid fetching these values again on each iteration, seeing that the
loop is storing into other integer members of *state. Given that this
loop is exactly the overhead we're trying to control by adjusting
skiptablesize, making it as tight as possible seems worthwhile.

Now that the skip table is a member of TextPositionState, I was not quite
sure if I should #define a size for it. It would certainly look neater, only
the code that defines the skip table size in text_position_start assumes 256
elements.

I tend to agree that a #define is pointless there, since there's no easy
way to make the relevant code do anything with it. It would be
worthwhile to point out adjacent to the code what the max length is,
though. I had gotten as far as rewriting your introductory comment like
this

/*
* Prepare the skip table for Boyer-Moore-Horspool searching. First we
* must determine how big a skip table to use. The declaration of
* TextPositionState allows up to 256 elements, but with haystack lengths
* of only a few characters we don't really want to have to initialize so
* many elements --- it would take too long in comparison to the actual
* search time. So we choose a useful skip table size based on the
* haystack length.

before noticing that what I was writing wasn't what the code actually
did. Another point that doesn't seem to be included in your comments is
that the skip table size *must* be a power of 2 because we use bit
masking. (In fact state->skiptablesize might better be named skiptablemask
since it's really 1 less than the table size.)

BTW, try to avoid introducing so much random vertical space. IMHO it
does little for readability and much to make routines too long to fit in
one screenful. Generally the PG code doesn't use double blank lines
anywhere inside a function, nor blank lines before a right brace line.
pg_indent will clean this up to some extent, but it would be better if
the code looked more like the project style in the first place.

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Tom Lane (#11)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

I wrote:

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

BTW, to the extent that you feel like testing a different idea,
I would suggest:

* don't bother initializing the skiptable when len1 < len2

* otherwise, choose its size based on len1 - len2, not just len1 or
len2. This is (one less than) the maximum number of search loop
consultations of the skip table that can happen, so it seems like a
plausible number, and better than either length alone.

regards, tom lane

#13

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Tom Lane (#12)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Tom Lane wrote:

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

Yes Bug. That's what I get for making last minute changes. It didn't affect
the benchmark, that was done before moving the code into postgresql for
testing. The function I tested with had an extra parameter to set the skip
table size. Each possible size was tested and the best time was saved along
with the best size.

BTW, to the extent that you feel like testing a different idea,
I would suggest:

* don't bother initializing the skiptable when len1 < len2

Good plan.

* otherwise, choose its size based on len1 - len2, not just len1 or
len2. This is (one less than) the maximum number of search loop
consultations of the skip table that can happen, so it seems like a
plausible number, and better than either length alone.

That seems like a better idea. I had considered len1 * len2, giving that's
the worst case for BMH. Of course the lengths would need to be raised quite
a bit. I'll go with len1 - len2

I'll make the above changes and then run my benchmark again to see if the
skip table size logic should be updated.

I'll also benchmark and update the benchmark spreadsheet to see what's
changed.

David.

#14

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Tom Lane (#12)

1 attachment(s)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

<<<Moved from pgsql-patches>>>

Tom Lane wrote:

I wrote:

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

BTW, to the extent that you feel like testing a different idea,
I would suggest:

* don't bother initializing the skiptable when len1 < len2

* otherwise, choose its size based on len1 - len2, not just len1 or
len2. This is (one less than) the maximum number of search loop
consultations of the skip table that can happen, so it seems like a
plausible number, and better than either length alone.

I've made the discussed changes. Also updated the benchmark results.
http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

On re-benchmarking to determine the best skip table size, the changes to the
logic didn't seem to affect the results enough to have to change the
thresholds. But I'm sure it will perform a little better in cases like when
both the needle and haystack are very long and close to being the same
length. Having a big skip table in this case is a little bit of a waste as
the search won't get to make much use of it.

Attachments:

boyer-moore-strpos_v1.3.patchapplication/octet-stream; name=boyer-moore-strpos_v1.3.patchDownload

*** varlena.c	2008-08-31 20:31:56.000000000 +0100
--- varlena.c	2008-09-07 02:47:52.000000000 +0100
***************
*** 40,45 ****
--- 40,47 ----
  	pg_wchar   *wstr2;			/* note: these are palloc'd */
  	int			len1;			/* string lengths in logical characters */
  	int			len2;
+ 	int			skiptable[256];	/* for Boyer-Moore-Horspool searching, must be 256 elements. */
+ 	int			skiptablemask;	/* mask for ANDing with skiptable */
  } TextPositionState;
  
  #define DatumGetUnknownP(X)			((unknown *) PG_DETOAST_DATUM(X))
***************
*** 795,800 ****
--- 797,803 ----
  	return result;
  }
  
+ 
  /*
   * text_position_setup, text_position_next, text_position_cleanup -
   *	Component steps of text_position()
***************
*** 804,811 ****
   * called multiple times with increasing values of start_pos, which is
   * the 1-based character position to start the search from.  The "state"
   * variable is normally just a local variable in the caller.
   */
- 
  static void
  text_position_setup(text *t1, text *t2, TextPositionState *state)
  {
--- 807,815 ----
   * called multiple times with increasing values of start_pos, which is
   * the 1-based character position to start the search from.  The "state"
   * variable is normally just a local variable in the caller.
+  *
+  * David Rowley: 2008-09-06 -- Updated for Boyer-Moore-Horspool searching
   */
  static void
  text_position_setup(text *t1, text *t2, TextPositionState *state)
  {
***************
*** 838,902 ****
  		state->len1 = len1;
  		state->len2 = len2;
  	}
! }
! 
! static int
  text_position_next(int start_pos, TextPositionState *state)
  {
! 	int			pos = 0,
! 				p,
! 				px;
! 
! 	Assert(start_pos > 0);		/* else caller error */
! 
! 	if (state->len2 <= 0)
! 		return start_pos;		/* result for empty pattern */
! 
! 	if (!state->use_wchar)
! 	{
! 		/* simple case - single byte encoding */
! 		char	   *p1 = state->str1;
! 		char	   *p2 = state->str2;
! 
! 		/* no use in searching str past point where search_str will fit */
! 		px = (state->len1 - state->len2);
! 
! 		p1 += start_pos - 1;
! 
! 		for (p = start_pos - 1; p <= px; p++)
! 		{
! 			if ((*p1 == *p2) && (strncmp(p1, p2, state->len2) == 0))
! 			{
! 				pos = p + 1;
! 				break;
! 			}
! 			p1++;
! 		}
! 	}
! 	else
! 	{
! 		/* not as simple - multibyte encoding */
! 		pg_wchar   *p1 = state->wstr1;
! 		pg_wchar   *p2 = state->wstr2;
! 
! 		/* no use in searching str past point where search_str will fit */
! 		px = (state->len1 - state->len2);
  
- 		p1 += start_pos - 1;
  
- 		for (p = start_pos - 1; p <= px; p++)
- 		{
- 			if ((*p1 == *p2) && (pg_wchar_strncmp(p1, p2, state->len2) == 0))
- 			{
- 				pos = p + 1;
- 				break;
- 			}
- 			p1++;
- 		}
- 	}
  
- 	return pos;
- }
  
  static void
  text_position_cleanup(TextPositionState *state)
--- 842,1005 ----
  		state->len1 = len1;
  		state->len2 = len2;
  	}
!   /*
!    * If the needle is bigger than the haystack then there
!    * is no point in wasting cycles initalizing anything else.
!    */
!   if (state->len2 < state->len1)
!   {
!     int     ai;
!     int     tablemask;
!     int     lneedle; 
! 
!     /* Prepare the skip table for Boyer-Moore-Horspool searching. First we
!      * must determine how much of the skip table to use.  The declaration of
!      * TextPositionState allows up to 256 elements, but with haystack lengths
!      * of only a few characters we don't really want to have to initialize so
!      * many elements --- it would take too long in comparison to the actual
!      * search time.  So we choose a useful skip table size based on the
!      * haystack length minus the needle length. The closer the needle length
!      * is to the haystack length the less useful skipping becomes.
!      * 
!      * Note: tablemask must be N^2-1 and no more than 255. 
!      */
!     if ((state->len1 - state->len2) < 16)
!       tablemask = 3;
!     else if ((state->len1 - state->len2) < 64)
!       tablemask = 7;
!     else if ((state->len1 - state->len2) < 128)
!       tablemask = 15;
!     else if ((state->len1 - state->len2) < 512)
!       tablemask = 31;
!     else if ((state->len1 - state->len2) < 2048)
!       tablemask = 63;
!     else if ((state->len1 - state->len2) < 4096)
!       tablemask = 127;
!     else
!       tablemask = 255;
! 
!     lneedle = state->len2;
!     state->skiptablemask = tablemask;
! 
!     /* Initalize the skip table. We set all elements to the needle length */
!     for (ai = 0; ai <= tablemask; ai++)
!       state->skiptable[ai] = lneedle;
! 
!     /* We do not process the last character in the needle */
!     lneedle--;
! 
!     if (!state->use_wchar) 
!     {
!       for (ai = 0; ai < lneedle; ai++)
!         state->skiptable[(unsigned char) state->str2[ai] & tablemask] = lneedle - ai;
!     }
!     else 
!     {
!       for (ai = 0; ai < lneedle; ai++)
!         state->skiptable[state->wstr2[ai] & tablemask] = lneedle - ai;
!     }
!   }
! }
! 
! /* text_position_next
!  * David Rowley 2008-09-06
!  * Uses Boyer-Moore-Horspool searching
!  */
! int
  text_position_next(int start_pos, TextPositionState *state)
  {
!   int haystack_len = state->len1;
!   int needle_len = state->len2;
!   int skipmask = state->skiptablemask;
! 
!   Assert(start_pos > 0);		/* else caller error */
! 
!   if (needle_len <= 0)
! 		return start_pos;		/* Empty needle */
! 
!   start_pos--; /* Change for zero based arrays */
!   /*
!    * Eliminate the impossible first. The needle is
!    * too big for the haystack. We've likely not setup
!    * the skip table when this happens anyway.
!    */
!   if (haystack_len + start_pos < needle_len)
!     return 0;
! 
!   Assert((skipmask & 255) == skipmask && skipmask >= 3); /* Must be N^2-1, min size 3 */
! 
!   if (!state->use_wchar)
!   {
!     char *nptr;
!     char *hptr;
!     char *p;
!     char *needle = state->str2;
!     char *haystack = state->str1;
!  
!     /* Start at startpos plus the length of the needle */
!     hptr = &haystack[needle_len - 1 + start_pos];
!     while (hptr < &haystack[haystack_len]) 
!     {
!       nptr = &needle[needle_len - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
!         p--;
!       }
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
!       hptr += state->skiptable[(unsigned char) *hptr & skipmask];
!     }
!     return 0; /* Not found */
!   } 
!   else
!   { 
!     /* The multibyte char version. This works exactly the same way. */
!     pg_wchar *nptr;
!     pg_wchar *hptr;
!     pg_wchar *p;
!     pg_wchar *needle = state->wstr2;
!     pg_wchar *haystack = state->wstr1;
!   
!     /* Start at start_pos plus the length of the needle */
!     hptr = &haystack[needle_len - 1 + start_pos];
!     while (hptr < &haystack[haystack_len]) 
!     {
!       nptr = &needle[needle_len - 1]; /* Point to the end of the needle */
!       p = hptr;
! 
!       while (*nptr == *p) 
!       {
!         /* Do we have it? Return 1 based array pos */
!         if (nptr-- == needle) 
!           return p - haystack + 1; 
!         p--;
!       }
!       /* Ask the skiptable where to look next. If it
!        * finds a match then we align the two ~matching~
!        * characters and start another search there.
!        * Else we skip a whole needle length.
!        * Of course the ~matching~ characters only have the
!        * same hash value, the character value may be different.
!        */
!        hptr += state->skiptable[*hptr & skipmask];
!     }
!     return 0; /* Not found */
!   }
! }
!  
  
  
  
  
  static void
  text_position_cleanup(TextPositionState *state)

#15

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: David Rowley (#14)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

"David Rowley" <dgrowley@gmail.com> writes:

I've made the discussed changes. Also updated the benchmark results.
http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

Applied with revisions; mostly cosmetic except for one point. I
realized after studying the code a bit more that B-M cannot possibly win
for a single-character pattern (needle), since the skip distance must
always be 1 in that case. The fact that it seemed to keep up at that
length has to be because the original coding included a strncmp call
inside the innermost loop, which likely prevents the compiler from
optimizing that loop really tightly. But the strncmp wasn't doing
anything anyway for the case of pattern length = 1. So what I committed
special-cases pattern length 1 to be a naive search with a *very* tight
inner loop. I think it's worth troubling over this case because a
common usage is split_to_array and suchlike with single-character
delimiters.

regards, tom lane

#16

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Tom Lane (#15)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

Tom Lane Wrote:

"David Rowley" <dgrowley@gmail.com> writes:

I've made the discussed changes. Also updated the benchmark results.
http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

Applied with revisions; mostly cosmetic except for one point. I
realized after studying the code a bit more that B-M cannot possibly win
for a single-character pattern (needle), since the skip distance must
always be 1 in that case. The fact that it seemed to keep up at that
length has to be because the original coding included a strncmp call
inside the innermost loop, which likely prevents the compiler from
optimizing that loop really tightly. But the strncmp wasn't doing
anything anyway for the case of pattern length = 1. So what I committed
special-cases pattern length 1 to be a naive search with a *very* tight
inner loop. I think it's worth troubling over this case because a
common usage is split_to_array and suchlike with single-character
delimiters.

I had thought about this, but failed to think about string_to_array.
Probably worth the extra code for that.

Well simple enough patch. I've learned quite a bit about the postgresql
source even just for working in varlena.c. I've got a very long way to go
though.

Thanks for all the reviews and suggestions.

David.

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#9)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Also, it would be nice to use B-M(-H) for LIKE as well.

Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?

Yep, something like that. Even if it only handled the special case of
'%foobar%', that would be nice, because that's a pretty common special case.

Anyway, it's not material for this patch, since it'd involve pretty
fundamental redesign of the LIKE code.

Yep.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: David Rowley (#16)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

David Rowley wrote:

Thanks for all the reviews and suggestions.

David, could you re-run the performance tests you ran earlier? I'm
curious to know what impact switching to the simpler loop for 1-byte
pattern had.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Heikki Linnakangas (#18)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas wrote:

David Rowley wrote:
Thanks for all the reviews and suggestions.

David, could you re-run the performance tests you ran earlier? I'm
curious to know what impact switching to the simpler loop for 1-byte
pattern had.

Sure: http://www.unixbeast.com/~fat/8.3_test_v1.4.xls

I tested with 8.3.3, and applied the patch updated by tom.

The thing that surprised me was that string_to_array didn't perform as well.
I expected single character searches to perform a little better. I can't
think why it would be slower now.

The strpos() test for the single char needle is taking 1.2% longer. I guess
that makes a little sense as we're now doing a new more length tests in this
case, but then we've eliminated the final single strncmp() for once it finds
that match. The strncmp would have check for nulls, check it's not exceeded
the compare length and check the chars actually match. I figured this would
have made up for that 1.2%. Seems not.

David.

-----Original Message-----
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: 08 September 2008 08:50
To: David Rowley
Cc: 'Tom Lane'; 'Peter Eisentraut'; pgsql-hackers@postgresql.org
Subject: Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching
(First time hacker)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20

David Rowley

dgrowley@gmail.com

over 17 years ago

In reply to: Heikki Linnakangas (#17)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Heikki Linnakangas Wrote:

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Also, it would be nice to use B-M(-H) for LIKE as well.

Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?

Yep, something like that. Even if it only handled the special case of
'%foobar%', that would be nice, because that's a pretty common special
case.

It would be a quick test to check for % at either end. But we'd also need to
ensure there were none in the middle. That might slow it down. I guess while
looping over the inner chars checking for more %'s we could be building a
skiptable. We'd have to abort it if we found any %'s of course

I think with out giving it much thought _'s could be handled by BMH, these
could just be given a skip distance of 2. Only having a lossy skip table
throws that out the window without having a special check for _

David.

#21

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: David Rowley (#19)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

David Rowley wrote:

The thing that surprised me was that string_to_array didn't perform as well.
I expected single character searches to perform a little better. I can't
think why it would be slower now.

Yes, that's strange. I tried to reproduce that here, with a CVS snapshot
before the patch, and after. With quick testing with psql and \timing
and the same query you had in that spreadsheet, I couldn't see that kind
of performance degradation. Oprofile suggests that, on the contrary,
slightly less time is spent in text_position_next() after the patch, and
slightly more in text_position_setup(). Together they account for ~10%
of CPU time in both tests, so a small difference there would be
insignificant anyway in that test case.

I think we're fine.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com