Tokyo Dystopia is a full-text search system. You can search lots of records for some records including specified patterns. The characteristic of Tokyo Dystopia is the following.
Tokyo Dystopia is available on platforms which have API conforming to C99 and POSIX. Tokyo Dystopia is a free software licensed under the GNU Lesser General Public License.
Install the latest version of Tokyo Cabinet beforehand and get the package of Tokyo Dystopia.
When an archive file of Tokyo Dystopia is extracted, change the current working directory to the generated directory and perform installation.
Run the configuration script.
./configure
Build programs.
make
Perform self-diagnostic test.
make check
Install programs. This operation must be carried out by the root user.
make install
When a series of work finishes, the following files will be installed.
/usr/local/include/tcqdb.h /usr/local/include/dystopia.h /usr/local/lib/libtokyodystopia.a /usr/local/lib/libtokyodystopia.so.1.3.0 /usr/local/lib/libtokyodystopia.so.1 /usr/local/lib/libtokyodystopia.so /usr/local/lib/pkgconfig/tokyodystopia.pc /usr/local/bin/tcqtest /usr/local/bin/tcqmgr /usr/local/bin/dysttest /usr/local/bin/dystmgr /usr/local/libexec/dystsearch.cgi /usr/local/share/tokyodystopia/... /usr/local/man/man1/... /usr/local/man/man3/...
The API of C is available by programs conforming to the C89 (ANSI C) standard or the C99 standard. As the header files of Tokyo Dystopia are provided as `tcrdb.h', applications should include it to use the API. As the library is provided as `libtokyodystopia.a' and `libtokyodystopia.so' and they depends `libtokyocabinet.so', `libz.so', `libpthread.so', `libm.so', and `libc.so', linker options `-ltokyodystopia', `-ltokyocabinet', `-lz', `-lpthread', `-lm', and `-lc' are required for build command. A typical build command is the following.
gcc -I/usr/local/include tc_example.c -o tc_example \ -L/usr/local/lib -ltokyodystopia -ltokyocabinet -lz -lpthread -lm -lc
You can also use Tokyo Dystopia in programs written in C++. Because each header is wrapped in C linkage (`extern "C"' block), you can simply include them into your C++ programs.
Indexed database is a directory containing a hash database file and its index files. The key of each record is a positive number. The value of each record is an arbitrary text data whose encoding is UTF-8. See `dystopia.h' for entire specification.
To use the core API, include `dystopia.h' and related standard header files. Usually, write the following description near the front of a source file.
#include <dystopia.h>#include <stdlib.h>#include <stdbool.h>#include <stdint.h>Objects whose type is pointer to `TCIDB' are used to handle indexed databases. A remote database object is created with the function `tcidbnew' and is deleted with the function `tcidbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.
Before operations to store or retrieve records, it is necessary to open a database directory and connect the indexed database object to it. The function `tcidbopen' is used to open a database directory and the function `tcidbclose' is used to close the database directory. To avoid data missing or corruption, it is important to close every database directory when it is no longer in use.
The function `tcidberrmsg' is used in order to get the message string corresponding to an error code.
const char *tcidberrmsg(int ecode);The function `tcidbnew' is used in order to create an indexed database object.
TCIDB *tcidbnew(void);The function `tcidbdel' is used in order to delete an indexed database object.
void tcidbdel(TCIDB *idb);The function `tcidbecode' is used in order to get the last happened error code of an indexed database object.
int tcidbecode(TCIDB *idb);The function `tcidbtune' is used in order to set the tuning parameters of an indexed database object.
bool tcidbtune(TCIDB *idb, int64_t ernum, int64_t etnum, int64_t iusiz, uint8_t opts);The function `tcidbsetcache' is used in order to set the caching parameters of an indexed database object.
bool tcidbsetcache(TCIDB *idb, int64_t icsiz, int32_t lcnum);The function `tcidbsetfwmmax' is used in order to set the maximum number of forward matching expansion of an indexed database object.
bool tcidbsetfwmmax(TCIDB *idb, uint32_t fwmmax);The function `tcidbopen' is used in order to open an indexed database object.
bool tcidbopen(TCIDB *idb, const char *path, int omode);The function `tcidbclose' is used in order to close an indexed database object.
bool tcidbclose(TCIDB *idb);The function `tcidbput' is used in order to store a record into an indexed database object.
bool tcidbput(TCIDB *idb, int64_t id, const char *text);The function `tcidbout' is used in order to remove a record of an indexed database object.
bool tcidbout(TCIDB *idb, int64_t id);The function `tcidbget' is used in order to retrieve a record of an indexed database object.
char *tcidbget(TCIDB *idb, int64_t id);The function `tcidbsearch' is used in order to search an indexed database.
uint64_t *tcidbsearch(TCIDB *idb, const char *word, int smode, int *np);The function `tcidbsearch2' is used in order to search an indexed database with a compound expression.
uint64_t *tcidbsearch2(TCIDB *idb, const char *expr, int *np);The function `tcidbiterinit' is used in order to initialize the iterator of an indexed database object.
bool tcidbiterinit(TCIDB *idb);The function `tcidbiternext' is used in order to get the next ID number of the iterator of an indexed database object.
uint64_t tcidbiternext(TCIDB *idb);The function `tcidbsync' is used in order to synchronize updated contents of an indexed database object with the files and the device.
bool tcidbsync(TCIDB *idb);The function `tcidboptimize' is used in order to optimize the files of an indexed database object.
bool tcidboptimize(TCIDB *idb);The function `tcidbvanish' is used in order to remove all records of an indexed database object.
bool tcidbvanish(TCIDB *idb);The function `tcidbcopy' is used in order to copy the database directory of an indexed database object.
bool tcidbcopy(TCIDB *idb, const char *path);The function `tcidbpath' is used in order to get the directory path of an indexed database object.
const char *tcidbpath(TCIDB *idb);The function `tcidbrnum' is used in order to get the number of records of an indexed database object.
uint64_t tcidbrnum(TCIDB *idb);The function `tcidbfsiz' is used in order to get the total size of the database files of an indexed database object.
uint64_t tcidbfsiz(TCIDB *idb);The function `tcidbsearch2' searches with a compound expression. In the compound expression, tokens are separated by one or more white space characters. If one token is specified, records including the specified pattern are searched for. Upper or lower case is not distinguished. Accent marks and diacritical marks are ignored. If two or more tokens are specified, records including all of the patterns are searched for. The compound expression includes the following sub expressions.
A B : searches for records including the two tokens.A && B : searches for records including the two tokens.A || B : searches for records including the one or both of the two tokens."A B..." : searches for records including the phrase.[[A]] : searches for records including words exactly matching the token.[[A*]] : searches for records including words beginning with the token.[[*A]] : searches for records including words ending with the token.[[[[A : searches for records beginning with the token.A]]]] : searches for records ending with the token.Note that the priority of "||" is higher than the one of "&&".
The following code is an example to use an indexed database.
#include <dystopia.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>
int main(int argc, char **argv){
TCIDB *idb;
int ecode, rnum, i;
uint64_t *result;
char *text;
/* create the object */
idb = tcidbnew();
/* open the database */
if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){
ecode = tcidbecode(idb);
fprintf(stderr, "open error: %s\n", tcidberrmsg(ecode));
}
/* store records */
if(!tcidbput(idb, 1, "George Washington") ||
!tcidbput(idb, 2, "John Adams") ||
!tcidbput(idb, 3, "Thomas Jefferson")){
ecode = tcidbecode(idb);
fprintf(stderr, "put error: %s\n", tcidberrmsg(ecode));
}
/* search records */
result = tcidbsearch2(idb, "john || thomas", &rnum);
if(result){
for(i = 0; i < rnum; i++){
text = tcidbget(idb, result[i]);
if(text){
printf("%d\t%s\n", (int)result[i], text);
free(text);
}
}
free(result);
} else {
ecode = tcidbecode(idb);
fprintf(stderr, "search error: %s\n", tcidberrmsg(ecode));
}
/* close the database */
if(!tcidbclose(idb)){
ecode = tcidbecode(idb);
fprintf(stderr, "close error: %s\n", tcidberrmsg(ecode));
}
/* delete the object */
tcidbdel(idb);
return 0;
}
Q-gram database is a file containing index of text. The key of each record is a positive number. The value of each record is an arbitrary text data whose encoding is UTF-8. Note that q-gram database is pure index and does not contain entity of records. See `tcqdb.h' for entire specification.
To use the q-gram database API, include `tcqdb.h' and related standard header files. Usually, write the following description near the front of a source file.
#include <tcqdb.h>#include <stdlib.h>#include <stdbool.h>#include <stdint.h>Objects whose type is pointer to `TCQDB' are used to handle q-gram databases. A remote database object is created with the function `tcqdbnew' and is deleted with the function `tcqdbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.
Before operations to store or retrieve records, it is necessary to open a database file and connect the q-gram database object to it. The function `tcqdbopen' is used to open a database file and the function `tcqdbclose' is used to close the database file. To avoid data missing or corruption, it is important to close every database file when it is no longer in use.
The constant `tdversion' is the string containing the version information.
extern const char *tdversion;The function `tcqdberrmsg' is used in order to get the message string corresponding to an error code.
const char *tcqdberrmsg(int ecode);The function `tcqdbnew' is used in order to create a q-gram database object.
TCQDB *tcqdbnew(void);The function `tcqdbdel' is used in order to delete a q-gram database object.
void tcqdbdel(TCQDB *qdb);The function `tcqdbecode' is used in order to get the last happened error code of a q-gram database object.
int tcqdbecode(TCQDB *qdb);The function `tcqdbtune' is used in order to set the tuning parameters of a q-gram database object.
bool tcqdbtune(TCQDB *qdb, int64_t etnum, uint8_t opts);The function `tcqdbsetcache' is used in order to set the caching parameters of a q-gram database object.
bool tcqdbsetcache(TCQDB *qdb, int64_t icsiz, int32_t lcnum);The function `tcqdbsetfwmmax' is used in order to set the maximum number of forward matching expansion of a q-gram database object.
bool tcqdbsetfwmmax(TCQDB *qdb, uint32_t fwmmax);The function `tcqdbopen' is used in order to open a q-gram database object.
bool tcqdbopen(TCQDB *qdb, const char *path, int omode);The function `tcqdbclose' is used in order to close a q-gram database object.
bool tcqdbclose(TCQDB *qdb);The function `tcqdbput' is used in order to store a record into a q-gram database object.
bool tcqdbput(TCQDB *qdb, int64_t id, const char *text);The function `tcqdbout' is used in order to remove a record of a q-gram database object.
bool tcqdbout(TCQDB *qdb, int64_t id, const char *text);The function `tcqdbsearch' is used in order to search a q-gram database.
uint64_t *tcqdbsearch(TCQDB *qdb, const char *word, int smode, int *np);The function `tcqdbsync' is used in order to synchronize updated contents of a q-gram database object with the file and the device.
bool tcqdbsync(TCQDB *qdb);The function `tcqdboptimize' is used in order to optimize the file of a q-gram database object.
bool tcqdboptimize(TCQDB *qdb);The function `tcqdbvanish' is used in order to remove all records of a q-gram database object.
bool tcqdbvanish(TCQDB *qdb);The function `tcqdbcopy' is used in order to copy the database file of a q-gram database object.
bool tcqdbcopy(TCQDB *qdb, const char *path);The function `tcqdbpath' is used in order to get the file path of a q-gram database object.
const char *tcqdbpath(TCQDB *qdb);The function `tcqdbtnum' is used in order to get the number of tokens of a q-gram database object.
uint64_t tcqdbtnum(TCQDB *qdb);The function `tcqdbfsiz' is used in order to get the size of the database file of a q-gram database object.
uint64_t tcqdbfsiz(TCQDB *qdb);The following code is an example to use an indexed database.
#include <tcqdb.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>
int main(int argc, char **argv){
TCQDB *qdb;
int ecode, rnum, i;
uint64_t *result;
/* create the object */
qdb = tcqdbnew();
/* open the database */
if(!tcqdbopen(qdb, "casket", QDBOWRITER | QDBOCREAT)){
ecode = tcqdbecode(qdb);
fprintf(stderr, "open error: %s\n", tcqdberrmsg(ecode));
}
/* store records */
if(!tcqdbput(qdb, 1, "George Washington") ||
!tcqdbput(qdb, 2, "John Adams") ||
!tcqdbput(qdb, 3, "Thomas Jefferson")){
ecode = tcqdbecode(qdb);
fprintf(stderr, "put error: %s\n", tcqdberrmsg(ecode));
}
/* search records */
result = tcqdbsearch(qdb, "John", QDBSSUBSTR, &rnum);
if(result){
for(i = 0; i < rnum; i++){
printf("%d\n", (int)result[i]);
}
free(result);
} else {
ecode = tcqdbecode(qdb);
fprintf(stderr, "search error: %s\n", tcqdberrmsg(ecode));
}
/* close the database */
if(!tcqdbclose(qdb)){
ecode = tcqdbecode(qdb);
fprintf(stderr, "close error: %s\n", tcqdberrmsg(ecode));
}
/* delete the object */
tcqdbdel(qdb);
return 0;
}
To use the core API easily, the commands `dysttest' and `dystmgr' are provided. To use the q-gram database API easily, the commands `tcqtest' and `tcqmgr' are provided.
The command `dysttest' is a utility for facility test and performance test of the core API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.
dysttest write [-tl] [-td|-tb] [-er num] [-ic num] [-xnt] [-nl|-nb] [-la num] [-en] path rnumdysttest read [-nl|-nb] [-la num] [-lm num] [-en] [-sp|-ss|-sf|-st|-stp|-sts] path rnumdysttest wicked [-tl] [-td|-tb] [-er num] [-ic num] [-nl|-nb] [-la num] [-en] path rnumOptions feature the following.
This command returns 0 on success, another on failure.
The command `dystmgr' is a utility for test and debugging of the core API and its applications. `path' specifies the path of a database directory. `ernum' specifies the expected number of records. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.
dystmgr create [-tl] [-td|-tb] path [ernum [etnum]]dystmgr inform [-nl|-nb] pathdystmgr put [-nl|-nb] path id textdystmgr out [-nl|-nb] path iddystmgr get [-nl|-nb] path iddystmgr search [-nl|-nb] [-eu|-ei|-ed] [-sp|-ss|-sf|-st|-stp|-sts] [-max num] [-ph] [-pv] path [word...]dystmgr list [-nl|-nb] [-max num] [-pv] pathdystmgr optimize [-nl|-nb] pathdystmgr importtsv [-ic num] [-nl|-nb] path [file]dystmgr versionOptions feature the following.
This command returns 0 on success, another on failure.
The command `tcqtest' is a utility for facility test and performance test of the q-gram database API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.
tcqtest write [-tl] [-td|-tb] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnumtcqtest read [-nl|-nb] [-la num] [-lm num] [-en] [-rc] [-ra] [-rs] [-sp|-ss|-sf] path rnumtcqtest wicked [-tl] [-td|-tb] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnumOptions feature the following.
This command returns 0 on success, another on failure.
The command `tcqmgr' is a utility for test and debugging of the q-gram database API and its applications. `path' specifies the path of a database directory. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.
tcqmgr create [-tl] [-td|-tb] path [etnum]tcqmgr inform [-nl|-nb] pathtcqmgr put [-nl|-nb] [-rc] [-ra] [-rs] path id texttcqmgr out [-nl|-nb] [-rc] [-ra] [-rs] path id texttcqmgr search [-nl|-nb] [-rc] [-ra] [-rs] [-eu|-ed] [-sp|-ss|-sf] [-max num] [-ph] path [word...]tcqmgr optimize [-nl|-nb] pathtcqmgr importtsv [-ic num] [-nl|-nb] [-rc] [-ra] [-rs] path [file]tcqmgr normalize [-rc] [-ra] [-rs] texttcqmgr versionOptions feature the following.
This command returns 0 on success, another on failure.
In this tutorial, we use the command `dystmgr' of the core API and try to make and search an indexed database.
To begin with, make the database with a TSV (tab-separated values) file. Each line of the TSV file represent a record. The first field specifies the ID number of the record, and the second field specifies the text. The ID number should be an arbitrary positive numerical value. The encoding of the text should be UTF-8. The following is an example.
1 United States 33 France 34 Spain 44 United Kingdom 49 Germany 55 Brazil 81 Japan
If the TSV file is named as `calling.tsv', it can be indexed by the following command. As the result, the database `casket' is generated.
dystmgr importtsv casket calling.tsv
It is possible to add a record to an existing database. If you add "China" as the ID number 83, perform the following command.
dystmgr put casket 83 "China"
It is possible to remove a record from a database. If you remove the record of the ID number 55, perform the following command.
dystmgr out casket 55
To print all records in a database, perform the following command.
dystmgr list -pv casket
To search for records including "united", perform the following command.
dystmgr search -pv casket "united"
The search expression of the command `dystmgr search' (the API function `tcidbsearch2' is called as "compound expression". This section describes the typical use of the compound expression. For first example, the following expression searches for records including "john" and "doe". That is, white space characters are treated as operators of logical intersection.
john doe
The following is equivalent to the above.
john && doe
Logical union is also supported. Use the operator "||" instead of "&&". For Example, the following expression searches for records including "john" or "james".
john || james
The token "john" matches such words as "johnson", "johnny", "demijohn" and so on. But, the token expression between "[[" and "]]" matches the word exactly. If you search for records including words exactly matching "john", specify the following expression.
[[john]]
The wild card "*" can be used in the token expression. If you search for records including words beginning with "john", specify the follow expression.
[[john*]]
Double quotation ("") is useful to void the meta characters described above. So, if you search for "xyz cookie", neither "xyzcookie", "cookie xyz", nor "xyz abc cookie", specify the following expression.
"xyz cookie"
The above expression matches "vwxyz cookie" and "xyz cookies" also. So, if you search for the exact word sequence "xyz cookie", specify the following expression.
[[xyz cookie]]
The priority of "||" is higher than the one of "&&". So, the following expression searches for records including "english" or "british", and including "bread" or "roll" or "bun".
english || british bread || roll || bun
The CGI script `dystsearch.cgi' is provided to search a database by Web interface. As it is installed as `/usr/local/libexec/dystsearch.cgi', copy it into a directory published by the Web server. The searched database should be named as `casket' and placed in the same directory of the CGI script.
Then, access the URL of the CGI script with a Web browser. A search form and some options are displayed there.
Tokyo Dystopia is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License or any later version.
Tokyo Dystopia is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with Tokyo Dystopia (See the file `COPYING'); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
Tokyo Dystopia was written by Mikio Hirabayashi. You can contact the author by e-mail to `mikio@users.sourceforge.net'.