tsearch2 dictionary for statute cites

Started by Kevin Grittnerabout 17 years ago16 messagesgeneral
Jump to latest
#1Kevin Grittner
Kevin.Grittner@wicourts.gov

I broached this topic last year[1]http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php, but the project got tabled until
now; so I raise it again. We want to be able to search text
(extracted from character-based PDF files) which will contain legal
terms and statute cites, and we want to be able to do tsearch2
searches (under 8.3.recent). It's clear enough how to create a
dictionary to gracefully handle the legal terms, but I'm less sure
about the statute cites.

I got one response[2]http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php, which mentioned a prefix search in the 8.4
release, and provided a link to a perl regular expression based
dictionary. I'm wondering if anyone has feedback one either of these
techniques, and whether they might work for our needs. I'm not sure I
adequately described our needs, so I'll fill that out a little more.

People are likely to search for statute cites, which tend to have a
hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.

Of course, we want someone to be able to search on 939.64, 939.641,
and 939.645 and get documents which reference all of the above (i.e.,
to look for a document referring to a hate crime committed while
concealing identity and wearing a bulletproof garment).

Suggestions welcome on how to handle this user requirement.

-Kevin

[1]: http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php
[2]: http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#1)
Re: tsearch2 dictionary for statute cites

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

People are likely to search for statute cites, which tend to have a
hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.

I think what you need is a custom parser that treats these similarly to
hyphenated words. If I pretend that the dot is a hyphen I get matching
behavior that seems to meet all those requirements.

Unfortunately we don't seem to have any really easy way to plug in a
custom parser, other than copy-paste-modify the existing one which would
be a PITA from a maintenance standpoint. Perhaps you could pass the
texts and the queries through a regexp substitution that converts
digit-dot-digit to digit-dash-digit?

regards, tom lane

#3Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#2)
Re: tsearch2 dictionary for statute cites

On Tue, 10 Mar 2009, Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

People are likely to search for statute cites, which tend to have a
hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.

I think what you need is a custom parser that treats these similarly to
hyphenated words. If I pretend that the dot is a hyphen I get matching
behavior that seems to meet all those requirements.

Unfortunately we don't seem to have any really easy way to plug in a
custom parser, other than copy-paste-modify the existing one which would
be a PITA from a maintenance standpoint. Perhaps you could pass the
texts and the queries through a regexp substitution that converts
digit-dot-digit to digit-dash-digit?

perhaps, for 8.4 it's better to utilize prefix search, like
to_tsquery('939.645:*') will find what Kevin need. The problem is with
parser, so I'd preprocess text before indexing to convert all
digit.digit(digit) to digit.digit.digit, which is what parser recognizes as
a single lexem 'version'. Here is just an illustration

qq=# select * from ts_parse('default',translate('939.64(1)','()','. '));
tokid | token
-------+----------
8 | 939.64.1
12 |

btw, having 'version' it's possible to use dict_regex for 8.3.

regards, tom lane

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Oleg Bartunov (#3)
Re: tsearch2 dictionary for statute cites

Oleg Bartunov <oleg@sai.msu.su> wrote:

On Tue, 10 Mar 2009, Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

People are likely to search for statute cites, which tend to have

a

hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state

statutes

dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find

subsections

like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.

Perhaps you could pass the texts and the queries through a regexp
substitution that converts digit-dot-digit to digit-dash-digit?

perhaps, for 8.4 it's better to utilize prefix search, like
to_tsquery('939.645:*') will find what Kevin need. The problem is

with

parser, so I'd preprocess text before indexing to convert all
digit.digit(digit) to digit.digit.digit, which is what parser

recognizes as

a single lexem 'version'. Here is just an illustration

qq=# select * from ts_parse('default',translate('939.64(1)','()','.

'));

tokid | token
-------+----------
8 | 939.64.1
12 |

btw, having 'version' it's possible to use dict_regex for 8.3.

Tom, Oleg: Thanks for the suggestions. Looks promising.

-Kevin

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#2)
Re: tsearch2 dictionary for statute cites

Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

People are likely to search for statute cites, which tend to have a
hierarchical form.

I think what you need is a custom parser

I've just returned to this and after review have become convinced that
this is absolutely necessary; once the default parser has done its
work, figuring out the bounds of a statute cite would be next to
impossible. Examples of the kind of fun you can have labeling
statutes, ordinances, and rules should you ever get elected to public
office:

10-3-350.10(1)(k)
10.1(40)(d)1
10.40.040(c)(2)
100.525(2)(a)3
105-10.G(3)(a)
11.04C.3.R.(1)
8.961.41(cm)
9.125.07(4A)(3)
947.013(1m)(a)

In any of these, a search string which exactly matches something up to
(but not including) a dash, dot, or left paren should find that thing.

Unfortunately we don't seem to have any really easy way to plug in a
custom parser, other than copy-paste-modify the existing one which
would be a PITA from a maintenance standpoint.

I'm afraid I'm going to have to bite the bullet and do this anyway.
Any guidance on how to go about it may save me some time. Also, if
there is any way to do this which may be useful to others or integrate
into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.

-Kevin

#6Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#2)
Re: tsearch2 dictionary for statute cites

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps you could pass the texts and the queries through a regexp
substitution that converts digit-dot-digit to digit-dash-digit?

This doesn't seem to get me anywhere. For cite '9.125.07(4A)(3)'
I got this:

select ts_debug('9-125-07-4A-3');
ts_debug
----------------------------------------------------------------
(uint,"Unsigned integer",9,{simple},simple,{9})
(int,"Signed integer",-125,{simple},simple,{-125})
(int,"Signed integer",-07,{simple},simple,{-07})
(int,"Signed integer",-4,{simple},simple,{-4})
(asciiword,"Word, all ASCII",A,{english_stem},english_stem,{})
(int,"Signed integer",-3,{simple},simple,{-3})
(6 rows)

Would there be a reasonable generalized way to pick something like
this out of a body of text using dictionaries and treat it as a
statute cite?

-Kevin

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#2)
Re: tsearch2 dictionary for statute cites

Tom Lane <tgl@sss.pgh.pa.us> wrote:

regexp substitution

I found a way to at least keep the cite in one piece. Perhaps I can
do the rest in custom dictionaries, which are more pluggable.

select ts_debug
('State Statute <cite value="SS9.125.07(4A)(3)"> pertaining to');
ts_debug
--------------------------------------------------------------------------------
(asciiword,"Word, all
ASCII",State,{english_stem},english_stem,{state})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all
ASCII",Statute,{english_stem},english_stem,{statut})
(blank,"Space symbols"," ",{},,)
(tag,"XML tag","<cite value=""SS9.125.07(4A)(3)"">",{},,)
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all
ASCII",pertaining,{english_stem},english_stem,{pertain})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",to,{english_stem},english_stem,{})
(9 rows)

-Kevin

#8Oleg Bartunov
oleg@sai.msu.su
In reply to: Kevin Grittner (#5)
Re: tsearch2 dictionary for statute cites

Kevin,

contrib/test_parser - an example parser code.

On Mon, 6 Apr 2009, Kevin Grittner wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

People are likely to search for statute cites, which tend to have a
hierarchical form.

I think what you need is a custom parser

I've just returned to this and after review have become convinced that
this is absolutely necessary; once the default parser has done its
work, figuring out the bounds of a statute cite would be next to
impossible. Examples of the kind of fun you can have labeling
statutes, ordinances, and rules should you ever get elected to public
office:

10-3-350.10(1)(k)
10.1(40)(d)1
10.40.040(c)(2)
100.525(2)(a)3
105-10.G(3)(a)
11.04C.3.R.(1)
8.961.41(cm)
9.125.07(4A)(3)
947.013(1m)(a)

In any of these, a search string which exactly matches something up to
(but not including) a dash, dot, or left paren should find that thing.

Unfortunately we don't seem to have any really easy way to plug in a
custom parser, other than copy-paste-modify the existing one which
would be a PITA from a maintenance standpoint.

I'm afraid I'm going to have to bite the bullet and do this anyway.
Any guidance on how to go about it may save me some time. Also, if
there is any way to do this which may be useful to others or integrate
into PostgreSQL to reduce the long-term PITA aspect, I'm all ears.

-Kevin

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Oleg Bartunov (#8)
Re: tsearch2 dictionary for statute cites

Oleg Bartunov <oleg@sai.msu.su> wrote:

contrib/test_parser - an example parser code.

Thanks! Sorry I missed that.

-Kevin

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Oleg Bartunov (#8)
Re: tsearch2 dictionary for statute cites

Oleg Bartunov <oleg@sai.msu.su> wrote:

contrib/test_parser - an example parser code.

Using that as a template, I seem to be on track to use the regexp.c
code to pick out statute cites from the text in my start function, and
recognize when I'm positioned on one in my getlexeme (GETTOKEN)
function, delegating everything before, between, and after statute
cites to the default parser. (I really didn't want to copy/paste and
modify the whole default parser.)

That leaves one question I'm still pretty fuzzy on -- how do I go
about having a statute cite in a tsquery match the entire statute cite
from a tsvector, or delimited leading portions of it, without having
it match shorter portions?

For example:

If the document text contains '341.15(3)' I want to find it with a
search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
'341.1', or '15'. How do I handle that? Do I have to build my
tsquery values myself as text and cast to tsquery, or is there
something more graceful that I'm missing?

-Kevin

#11Oleg Bartunov
oleg@sai.msu.su
In reply to: Kevin Grittner (#10)
Re: tsearch2 dictionary for statute cites

On Tue, 7 Apr 2009, Kevin Grittner wrote:

If the document text contains '341.15(3)' I want to find it with a
search string of '341', '341.15', '341.15(3)' but not '341.15(3)(b)',
'341.1', or '15'. How do I handle that? Do I have to build my
tsquery values myself as text and cast to tsquery, or is there
something more graceful that I'm missing?

of course, you can build tsquery youself, but once your parser can
recognize your very own token 'xxx', it'd be much better to have
mapping xxx -> dict_xxx, where dict_xxx knows all semantics.
For example, we have our dict_regex
http://vo.astronet.ru/arxiv/dict_regex.html

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#12Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Oleg Bartunov (#11)
Re: tsearch2 dictionary for statute cites

Oleg Bartunov <oleg@sai.msu.su> wrote:

of course, you can build tsquery youself, but once your parser can
recognize your very own token 'xxx', it'd be much better to have
mapping xxx -> dict_xxx, where dict_xxx knows all semantics.

I probably just need to have that "Aha!" moment, slap my forehead, and
move on; but I'm not quite understanding something. The answer to
this question could be it: Can I use a different set of dictionaries
for creating the tsquery than I did for the tsvector?

If so, I can have the dictionaries which generate the tsvector include
the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
dictionaries for the tsquery can only generate the token based on
exactly what the user typed. That would give me exactly what I want,
but somehow I have gotten the impression that the tsvector and tsquery
need to be generated using the same dictionary set.

I hope that's a mistaken impression?

-Kevin

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#12)
Re: tsearch2 dictionary for statute cites

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Can I use a different set of dictionaries
for creating the tsquery than I did for the tsvector?

Sure, as long as the tokens (normalized words) that they produce match
up for words that you want to have match. Once the tokens come out,
they're just strings as far as the rest of the text search machinery
is concerned.

regards, tom lane

#14Oleg Bartunov
oleg@sai.msu.su
In reply to: Kevin Grittner (#12)
Re: tsearch2 dictionary for statute cites

On Tue, 7 Apr 2009, Kevin Grittner wrote:

Oleg Bartunov <oleg@sai.msu.su> wrote:

of course, you can build tsquery youself, but once your parser can
recognize your very own token 'xxx', it'd be much better to have
mapping xxx -> dict_xxx, where dict_xxx knows all semantics.

I probably just need to have that "Aha!" moment, slap my forehead, and
move on; but I'm not quite understanding something. The answer to
this question could be it: Can I use a different set of dictionaries
for creating the tsquery than I did for the tsvector?

Sure ! For example, you want to index all words, so your dictionaries
doesn't have stop word lists, but forbid people to search common words.
Or, if you want to search 'to be or not to be' you have to use
dictionaries without stop words.

If so, I can have the dictionaries which generate the tsvector include
the appropriate leading tokens ('341', '341.15', '341.15(3)') and the
dictionaries for the tsquery can only generate the token based on
exactly what the user typed. That would give me exactly what I want,
but somehow I have gotten the impression that the tsvector and tsquery
need to be generated using the same dictionary set.

I hope that's a mistaken impression?

yes.

-Kevin

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#15Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#13)
Re: tsearch2 dictionary for statute cites

Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Can I use a different set of dictionaries
for creating the tsquery than I did for the tsvector?

Sure, as long as the tokens (normalized words) that they produce
match up for words that you want to have match. Once the tokens
come out, they're just strings as far as the rest of the text search
machinery is concerned.

Fantastic! Don't know how I got confused about that, but the way now
looks clear.

Thanks!

-Kevin

#16Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Oleg Bartunov (#14)
Re: SOLVED: tsearch2 dictionary for statute cites

Oleg Bartunov <oleg@sai.msu.su> wrote:

I probably just need to have that "Aha!" moment, slap my forehead,

and

move on; but I'm not quite understanding something. The answer to
this question could be it: Can I use a different set of

dictionaries

for creating the tsquery than I did for the tsvector?

Sure ! For example, you want to index all words, so your

dictionaries

doesn't have stop word lists, but forbid people to search common

words.

Or, if you want to search 'to be or not to be' you have to use
dictionaries without stop words.

I found a creative solution which I think meets my needs. I'm posting
both to help out anyone with similar issues who finds the thread, and
in case someone sees an obvious defect. By creating one function to
generate the "legal" tsvector (which recognizes statute cites) and
another function to generate the search values, with casts from text
to the ts objects, I can get more targeted results than the parser and
dictionary changes alone could give me.

I'm still working on the dictionaries and the query function, but the
vector function currently looks like the attached.

Thanks to Oleg and Tom for assistance; while neither suggested quite
this solution, their comments moved me along to where I found it.

-Kevin

Attachments:

to_legal_tsvector.sqlapplication/octet-stream; name=to_legal_tsvector.sqlDownload