kirara-2.2b/QUICK_GUIDE

sh-4.2$ curl -O http://plan9.aichi-u.ac.jp/netlib/kirara/kirara-2.2b.tgz
 % Total    % Received % Xferd  Average Speed   Time    Time     Time Current
                                Dload  Upload   Total   Spent    Left Speed
100 86796 100 86796    0     0  34497      0  0:00:02  0:00:02 --:--:-- 34497

sh-4.2$ tar -xf kirara-2.2b.tgz
sh-4.2$ cat kirara-2.2b/QUICK_GUIDE
-------------

Kirara1 (ver 1.5)
Kirara2 (ver 2.2)

look
http://plan9.aichi-u.ac.jp/netlib/
for new version or bug fix version.

-------------

Kirara is a desktop search engine for Plan 9.

Indexing and search target:
- alphanumeric words in Latin text including English, German, and etc.
- alphabetic words in Greek, Coptic, Cyrillic, Armenian
- CJK Kanji and Japanese Katakana

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing   ->|<- quick search # comflicting
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara2 is similar to Kirara1 but stands on different concept.
Kirara1 supports a sort of phrase search but slow.
Kirara2 improves this problem at the cost of indexing space and updating time.

phrase search of Kirara: search text lines with multiple query words.

Kirara1: The database is index of words that points to directory.
Kirara2: The database is index of words that points to file.

Both support:
QE mode (query expression mode)
'&', '|', '*', '?', ' '
The example:
'snoopy&html'
'sn?o*y htm*'

Look MAN_KFIND for detail.

-------------

Words

Two or more aphanumeric letters including '_' for Latin text.
All words are converted to lower case.
The minimum number of letters is configurable.

Assumption:
Text is composed of words that are separated by spaces and symbols.
This is popular in English and many European Languages, but not in Japanese.
In extracting words from Japanese text, Hiragana is discarded. This is reasonable
strategy in handling Japanese.

-------------

The user's interface

Best match with Rio. 
term% kfind1 snoopy
G1 'snoopy' /lib/dict/
G1 'snoopy' /sys/lib/linux/var/lib/apt/lists/
G1 'snoopy' /sys/lib/man/lookman/
G1 'snoopy' /sys/lib/sysconfig/auth/
G1 'snoopy' /sys/lib/sysconfig/proto/
G1 'snoopy' /sys/man/
G1 'snoopy' /sys/man/3/
G1 'snoopy' /sys/man/8/
G1 'snoopy' /sys/src/cmd/gs/doc/
G1 'snoopy' /sys/src/cmd/ip/
...
This is an example for Kirara1.
In Kirara2, replace: kfind1 -> kfind2 and G1 -> G2
and also, directories -> files

Note that: two steps

Kirara1:
1. find directories
2. find files and the contents
Step 2 is actually 'grep'.

Kirara2:
1. find files
2. find the contents
Step 2 is 'grep' for small files and new tools (look MAN_MKDICT) for large files.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The indexing organization

My example

Kirara1: /n/other/kirara1/sysdb
Kirara2: /n/other/kirara2/sysdb
target=(/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

Kirara1: /n/other/kirara1/usrdb
Kirara2: /n/other/kirara2/usrdb
target=$home/^(bin/rc lib netlib doc adm issues srclib src)

Now, you may have as many usr*db as you like.

Indexing target is fully configurable.
Excluding target is also supported.
It is wise not to include dictionary such as "/lib/dict" which is
annoying in retrieval.

-------------

Multi-Level Indexing (Kirara1)

(1) Indexing (top level)
word to directory mapping
sysdb/main # main index
sysdb/mtoc # meta index (table of contents for main)
sysdb/qdir/* # index of each directory (only for large direcories)
sysdb/qtd # map table (qid, mtime, path_to_dir)
sysdb/qd # map table (qid, path_to_dir)

main # word to dir qid
aa 0000000000014f0a
aa 000000000001a1e0
aa 000000000001a26e

mtoc # word to range in main
aa 0 126669
ab 126669 491569
ac 491569 1258566
ad 1258566 1852467
...

qdir/*/
where '*' is qid of directories
in qdir/*/ we have
qtn # map table of files (qid, mtime, name)
and optionary, word to file mapping:
ind.gz # fine index of the directory (gzipped)

usrdb is same as sysdb.

-------------

Multi-Level Indexing (Kirara2)

(1) Indexing (top level)
word to file mapping
sysdb/main # main index
sysdb/mtoc # meta index (table of contents for main)
sysdb/qdir/* # index of each file (only for large files)
sysdb/qtf # map table (qid, mtime, path_to_file)
sysdb/qf # map table (qid, path_to_file)

main # word to file qid
aa 000000000001a1e8
aa 000000000001a26f
aa 000000000001a277

mtoc # word to range in main
aa 5393489 5824028
ab 5824028 6963148
ac 6963148 9918376
ad 9918376 11926357
...
qdir/*/
where '*' is qid of files
in qdir/*/ we have files:
ind.gz # fine index of the file (gzipped)
dict:X # dictionary of the file
list:X # inverted list of the file
indx:X # map table of dict:X

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 8GB (only 4GB is used in 386 kernel)

(b) software
9front
cwfs64x

-------------

NB: The performance data is of older version.


The performance (compression ratio)

(a) Kirara1

target target num_of_dirs    index
sysdb:   380 MB   1790 dirs 100 MB
usrdb: 1900 MB   9000 dirs 230 MB

target: the size is total ammount of text file size.
index: includes "main" + (all other files for indexing).
ratio: 100/380 (sysdb)
usrdb includes many large file which improved the ratio.
current version requies much indexing space than previous version.
this is because current version of word extractor supports non-ascii runes.

* tiny small medium large
sysdb 31227 313 12 4
usrdb 85246 1509 220 17

where
tiny <100K
small <1000K
medium <10000K
large


(b) Kirara2

target target  num_of_files    index
sysdb:   380 MB  36000 files 170 MB
usrdb: 1900 MB  92000 files 580 MB


-------------

The performance (retrieval time)
system dependent and cache state dependent

QE search # kfind foo

latency:
term% kfind snoopy
0.2 seconds for Kirara1.
0.2 seconds for Kirara2.

It is not important to make these times smaller.
(sufficiently small)

Normal QE search does not make much difference between Kirara1 and Kirara2.
The difference is in phrase search. The followings is an example.
(kfind1 and kfind2 are seach command for Kirara1 and Kirara2 respectively.)

term% date -n; kfind2 'snoopy proto' |rc; date -n
1381328072
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61: snoopy
...
1381328077
term% date -n; kfind1 'snoopy proto' |rc; date -n
1381328119
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61: snoopy
...
1381328137
term%

and another example:

term% date -n; kfind2 'include regexp' | rc |wc ; date -n
1381329707
    247    3406   37076
1381329767
term% date -n; kfind1 'include regexp' | rc |wc ; date -n
1381329805
    263    3454   38337
1381329983
term% 

Kfind2 is three times faster than kfind1.

NOTE: both Kirara1 and Kirara2 show excessive "matched" lines such as
/sys/lib/sysconfig/proto/standalone:61: snoopy
("proto" is matched to pathname)
This tendancy is stronger in Kirara1 than in Kirara2.

-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction

* Kirara1 Kirara2
sysdb 15m 30m
usrdb 50m   100m

m: minutes

(b) Updating time
two commands for update

mkdb -r # scans all dirs in the target to detect update

* Kirara1 Kirara2
usrdb 2m 6m

these times depend much on state of the cache (dirs in the target and the file "main")

NB: the use of "-r" option for Kirara1 may miss the change because of
mtime bug in cwfs. (Fossil is free from this bug)

mkdb -L # consults event log to detect update (only for usrdb)

* Kirara1 Kirara2
usrdb 2m 4m

the "-L" option save the time for scanining dirs in the target.
these times depend much on state of the cache (the file "main").
the above result is the case that is poorly cached.
if cached well, the 2m of Kirara1 reduces to 14sec.
on the other hand, the 4m of Kirara2 does not reduce to such a small value,
I suspect that is because the size of "main" is too large to be fully cached.

NB: needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query

(b) initial construction time
proportional to total data

(c) update time
mkdb1 -r
  proportional to number of dirs and the changes.
mkdb1 -L
proportional to changes and size of index (file named "main").

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something that is shining, twincling
or gleaming beautifully and gracefully.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html
sh-4.2$ 

No comments:

Post a Comment