WIP: shared ispell dictionary
Hello
attached patch add possibility to share ispell dictionary between
processes. The reason for this is the slowness of first tsearch query
and size of allocated memory per process. When I tested loading of
ispell dictionary (for Czech language) I got about 500 ms and 48MB.
With simple allocator it uses only 25 MB. If we remove some check and
tolower string transformation from loading stage it needs only 200 ms.
But with broken dict or affix file it can put wrong results. This
patch significantly reduce load on servers that use ispell
dictionaries.
I know so Tom worries about using of share memory. I think so it
unnecessarily. After loading data from dictionary are only read, never
modified. Second idea - this dictionary template can be distributed as
separate project (it needs a few changes in core - and simple
allocator).
Using:
a) set shared_data = 26MB (postgres.conf)
b) restart
c) register dictionary with option "share=yes"
CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech,
share = yes);
[pavel@nemesis src]$ psql-dev3 postgres
Timing is on.
psql-dev3 (9.0devel)
Type "help" for help.
postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+-----------+-----------------+------------+-------------
word | Word, all letters | Příliš | {cspell,simple} | cspell
| {příliš}
blank | Space symbols | | {} | |
word | Word, all letters | žluťoučký | {cspell,simple} | cspell
| {žluťoučký}
blank | Space symbols | | {} | |
word | Word, all letters | kůň | {cspell,simple} | cspell
| {kůň}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | se | {cspell,simple} | cspell | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | napil | {cspell,simple} | cspell
| {napít}
blank | Space symbols | | {} | |
word | Word, all letters | žluté | {cspell,simple} | cspell
| {žlutý}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | vody | {cspell,simple} | cspell
| {voda}
(13 rows)
Time: 8,178 ms <<-- without patch 500ms
Limits and ToDo:
a) it support only simple regular expressions
b) it doesn't solve cache reset a shared memory deallocation
Regards
Pavel Stehule
Attachments:
shared_dictionary_02.diffapplication/octet-stream; name=shared_dictionary_02.diffDownload+747-291
Pavel Stehule wrote:
attached patch add possibility to share ispell dictionary between
processes. The reason for this is the slowness of first tsearch query
and size of allocated memory per process. When I tested loading of
ispell dictionary (for Czech language) I got about 500 ms and 48MB.
With simple allocator it uses only 25 MB. If we remove some check and
tolower string transformation from loading stage it needs only 200 ms.
But with broken dict or affix file it can put wrong results. This
patch significantly reduce load on servers that use ispell
dictionaries.I know so Tom worries about using of share memory. I think so it
unnecessarily. After loading data from dictionary are only read, never
modified. Second idea - this dictionary template can be distributed as
separate project (it needs a few changes in core - and simple
allocator).
Fixed-size shared memory blocks are always problematic. Would it be
possible to do the preloading with shared_preload_libraries somehow?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
2010/3/18 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
Pavel Stehule wrote:
attached patch add possibility to share ispell dictionary between
processes. The reason for this is the slowness of first tsearch query
and size of allocated memory per process. When I tested loading of
ispell dictionary (for Czech language) I got about 500 ms and 48MB.
With simple allocator it uses only 25 MB. If we remove some check and
tolower string transformation from loading stage it needs only 200 ms.
But with broken dict or affix file it can put wrong results. This
patch significantly reduce load on servers that use ispell
dictionaries.I know so Tom worries about using of share memory. I think so it
unnecessarily. After loading data from dictionary are only read, never
modified. Second idea - this dictionary template can be distributed as
separate project (it needs a few changes in core - and simple
allocator).Fixed-size shared memory blocks are always problematic. Would it be
possible to do the preloading with shared_preload_libraries somehow?
Maybe. But there are some disadvantages: a) you have to copy
dictionary info to config, b) on some systems can be a problem lot of
memory per process (probably not on linux). Still you have to do some
bridge between tsearch cache and preloaded data.
Pavel
Show quoted text
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Pavel Stehule <pavel.stehule@gmail.com> writes:
I know so Tom worries about using of share memory.
You're right, and if I have any say in the matter no patch like this
will ever go in.
What I would suggest looking into is some way of preprocessing the raw
text dictionary file into a format that can be slurped into memory
quickly. The main problem compared to the way things are done now
is that the current internal format relies heavily on pointers.
Maybe you could replace those by offsets?
regards, tom lane
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
Pavel Stehule <pavel.stehule@gmail.com> writes:
I know so Tom worries about using of share memory.
You're right, and if I have any say in the matter no patch like this
will ever go in.What I would suggest looking into is some way of preprocessing the raw
text dictionary file into a format that can be slurped into memory
quickly. The main problem compared to the way things are done now
is that the current internal format relies heavily on pointers.
Maybe you could replace those by offsets?
You have to maintain a new application :( There can be a new kind of bugs.
I playing with preload solution now. And I found a new issue.
I don't know why, but when I preload library with large mem like
ispell, then all next operations are ten times slower :(
[pavel@nemesis tsearch]$ psql-dev3 postgres
Timing is on.
psql-dev3 (9.0devel)
Type "help" for help.
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 0,611 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 0,277 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 0,266 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 0,348 ms
postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a
bydlím ve Skalici');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------
asciiword | Word, all ASCII | Jmenuji | {preloaded_cspell,simple} |
preloaded_cspell | {jmenovat}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | se | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Pavel | {preloaded_cspell,simple} |
preloaded_cspell | {pavel,pavla}
blank | Space symbols | | {} |
|
word | Word, all letters | Stěhule | {preloaded_cspell,simple} |
preloaded_cspell | {stěhule}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | a | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
word | Word, all letters | bydlím | {preloaded_cspell,simple} |
preloaded_cspell | {bydlet,bydlit}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | ve | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Skalici | {preloaded_cspell,simple} |
preloaded_cspell | {skalice}
(15 rows)
Time: 24,495 ms
postgres=# select * from ts_debug('cs','Jmenuji se Pavel Stěhule a
bydlím ve Skalici');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------
asciiword | Word, all ASCII | Jmenuji | {preloaded_cspell,simple} |
preloaded_cspell | {jmenovat}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | se | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Pavel | {preloaded_cspell,simple} |
preloaded_cspell | {pavel,pavla}
blank | Space symbols | | {} |
|
word | Word, all letters | Stěhule | {preloaded_cspell,simple} |
preloaded_cspell | {stěhule}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | a | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
word | Word, all letters | bydlím | {preloaded_cspell,simple} |
preloaded_cspell | {bydlet,bydlit}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | ve | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Skalici | {preloaded_cspell,simple} |
preloaded_cspell | {skalice}
(15 rows)
...skipping...
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+---------+---------------------------+------------------+----------------
asciiword | Word, all ASCII | Jmenuji | {preloaded_cspell,simple} |
preloaded_cspell | {jmenovat}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | se | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Pavel | {preloaded_cspell,simple} |
preloaded_cspell | {pavel,pavla}
blank | Space symbols | | {} |
|
word | Word, all letters | Stěhule | {preloaded_cspell,simple} |
preloaded_cspell | {stěhule}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | a | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
word | Word, all letters | bydlím | {preloaded_cspell,simple} |
preloaded_cspell | {bydlet,bydlit}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | ve | {preloaded_cspell,simple} |
preloaded_cspell | {}
blank | Space symbols | | {} |
|
asciiword | Word, all ASCII | Skalici | {preloaded_cspell,simple} |
preloaded_cspell | {skalice}
(15 rows)
~
~
~
Time: 18,426 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 12,700 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 12,465 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 12,603 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 12,901 ms
postgres=# select 10;
?column?
----------
10
(1 row)
Time: 12,642 ms
When I reduce memory with simple allocator, then this issue is
removed, but it is strange.
Pavel
Show quoted text
regards, tom lane
2010/3/18 Pavel Stehule <pavel.stehule@gmail.com>:
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
Pavel Stehule <pavel.stehule@gmail.com> writes:
I know so Tom worries about using of share memory.
You're right, and if I have any say in the matter no patch like this
will ever go in.What I would suggest looking into is some way of preprocessing the raw
text dictionary file into a format that can be slurped into memory
quickly. The main problem compared to the way things are done now
is that the current internal format relies heavily on pointers.
Maybe you could replace those by offsets?You have to maintain a new application :( There can be a new kind of bugs.
I playing with preload solution now. And I found a new issue.
I don't know why, but when I preload library with large mem like
ispell, then all next operations are ten times slower :(
this strange issue is from very large memory context. When I don't
join tseach cached context with working context, then this issue
doesn't exists.
Datum
dpreloaddict_init(PG_FUNCTION_ARGS)
{
<------>if (prepd == NULL)
<------><------>return dispell_init(fcinfo); // use without preloading
<------>else
<------>{
<------>
<------><------>//return PointerGetDatum(prepd);
<------><------>/*.
<------><------> * Add preload context to current conntext -- when
this code is active, then I have a issue
<------><------> */
<------><------>preload_ctx->parent = CurrentMemoryContext;
<------><------>preload_ctx->nextchild = CurrentMemoryContext->firstchild;
<------><------>CurrentMemoryContext->firstchild = preload_ctx;
<------><------>
<------><------>return PointerGetDatum(prepd);
<------>}
}
Pavel
Show quoted text
When I reduce memory with simple allocator, then this issue is
removed, but it is strange.Pavel
regards, tom lane
2010/3/18 Tom Lane <tgl@sss.pgh.pa.us>:
Pavel Stehule <pavel.stehule@gmail.com> writes:
I know so Tom worries about using of share memory.
You're right, and if I have any say in the matter no patch like this
will ever go in.
I wrote second patch based on preloading. For real using it needs to
design parametrisation. It working well - on Linux. It is simple and
fast (with simple alloc). I am not sure about others systems.
Minimally it can exists as contrib module.
Pavel
Show quoted text
What I would suggest looking into is some way of preprocessing the raw
text dictionary file into a format that can be slurped into memory
quickly. The main problem compared to the way things are done now
is that the current internal format relies heavily on pointers.
Maybe you could replace those by offsets?regards, tom lane