问下stardict的字典用的是什么数据结构

liuhello · #1

最近在学习数据结构和算法想写个例子来锻炼自己的实战能力我初期的想法是写一个辞典暂时是在命令行状态的辞典。
当用户输入一个单词的时候，程序读取这个单词并且到字典里面去查询是否有这个单词的记录并且给出相应的提示。
我看了下大部分的辞典都是把字典设计成一个文件，不同字典风灾不同的文件。
现在我遇到的最大的一个问题是我的这个字典应该怎么设计才能提高查询速度并且在不同的机器件这个字典文件都可以用。

我考虑过直接把字典设计成一个二进制文件，里面保存的是很多个单词及其信息的结构体。不知道这样是否可行。

不知道大家有知道stardict中字典是怎么设计的。希望能多给点提示和指点。

kofshower · #2

自己看源码,里面写的很清楚了,至于字典格式:

Format for StarDict dictionary files
------------------------------------

StarDict homepage: http://stardict.sourceforge.net

{0}. Number and Byte-order Conventions
When you record the numbers that identify sizes, offsets, etc., you
should use 32-bit numbers, such as you might represent with a glong.

In order to make StarDict work on different platforms, these numbers
must be in network byte order. You can ensure the correct byte order
by using the g_htonl() function when creating dictionary files.
Conversely, you should use g_ntohl() when reading dictionary files.

Strings should be encoded in UTF-8.

{1}. Files
Every dictionary consists of three files:
(1). somedict.ifo
(2). somedict.idx or somedict.idx.gz
(3). somedict.dict or somedict.dict.dz

You can use gzip -9 to compress the .idx file. If the .idx file are not
compressed, the loading can be fast and save memory when using, compress it
will make the .idx file load into memory and make the quering fast when using.

You can use dictzip to compress the .dict file.
"dictzip" uses the same compression algorithm and file format as does gzip,
but provides a table that can be used to randomly access compressed blocks
in the file. The use of 50-64kB blocks for compression typically degrades
compression by less than 10%, while maintaining acceptable random access
capabilities for all data in the file. As an added benefit, files
compressed with dictzip can be decompressed with gunzip.
For more information about dictzip, refer to DICT project, please see:
http://www.dict.org

Stardict will search for the .ifo file, then open the .idx or
.idx.gz file and the .dict.dz or .dict file which is in the same directory and
has the same base name.

{2}. The ".ifo" file's format.
The .ifo file has the following format:

StarDict's dict ifo file
version=2.4.2
[options]

Note that the current "version" string must be "2.4.2". If it's not,
then StarDict will refuse to read the file.

[options]
---------
In the example above, [options] expands to any of the following lines
specifying information about the dictionary. Each option is a keyword
followed by an equal sign, then the value of that option, then a
newline. The options may be appear in any order.

Note that the dictionary must have at least a bookname, a wordcount and a
idxfilesize, or the load will fail. All other information is optional. All
strings should be encoded in UTF-8.

Available options:

bookname= // required
wordcount= // required
idxfilesize= // required
author=
email=
website=
description=
date=
sametypesequence= // very important.

wordcount is the count of word entries in .idx file, it must be right.

idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
to a .idx.gz file, this entry must record the original .idx file's size, and it
must be right too. The .gz file don't contain its original size information,
but knowing the original size can speed up the extraction to memory, as you
don't need to call realloc() for many times.

The "sametypesequence" option is described in further detail below.

***
sametypesequence

You should first familiarize yourself with the .dict file format
described in the next section so that you can understand what effect
this option has on the .dict file.

If the sametypesequence option is set, it tells StarDict that each
word's data in the .dict file will have the same sequence of datatypes.
In this case, we expect a .dict file that's been optimized in two
ways: the type identifiers should be omitted, and the size marker for
the last data entry of each word should be omitted.

Let's consider some concrete examples of the sametypesequence option.

Suppose that a dictionary records many .wav files, and so sets:
sametypesequence=W
In this case, each word's entry in the .dict file consists solely of a
wav file. In the .dict file, you would leave out the 'W' character
before each entry, and you would also omit the 32-bit integer at the
front of each .wav entry that would normally give the entry's length.
You can do this since the length is known from the information in the
idx file.

As another example, suppose a dictionary contains phonetic information
and a meaning for each word. The sametypesequence option for this
dictionary would be:
sametypesequence=tm
Once again, you can omit the 't' and 'm' characters before each data
entry in the .dict file. In addition, you should omit the terminating
'\0' for the 'm' entry for each word in the .dict file, as the length
of the meaning string can be inferred from the length of the phonetic
string (still indicated by a terminating '\0') and the length of the
entire word entry (listed in the .idx file).

So for cases where the last data entry for each word normally requires
a terminating '\0' character, you should omit this character in the
dict file. And for cases where the last data entry for each word
normally requires an initial 32-bit number giving the length of the
field (such as WAV and PNG entries), you must omit this number in the
dictionary.

Every dictionary should try to use the sametypesequence feature to
save disk space.
***

{3}. The ".idx" file's format.
The .idx file is just a word list.

The word list is a sorted list of word entries.

Each entry in the word list contains three fields, one after the other:
word_str; // a utf-8 string terminated by '\0'.
word_data_offset; // word data's offset in .dict file
word_data_size; // word data's total size in .dict file

word_str gives the string representing this word. It's the string
that is "looked up" by the StarDict.

word_data_offset and word_data_size should both be 32-bit numbers in
network byte order.

No two entries should have the same "word_str". In other words,
(strcmp(s1, s2) != 0).

The length of "word_str" should be less than 256. In other words,
(strlen(word) < 256).

The word list must be sorted by calling stardict_strcmp() on the "word_str"
fields. If the word list order is wrong, StarDict will fail to function
correctly!

============
gint stardict_strcmp(const gchar *s1, const gchar *s2)
{
gint a;
a = g_ascii_strcasecmp(s1, s2);
if (a == 0)
return strcmp(s1, s2);
else
return a;
}
============
g_ascii_strcasecmp() is a glib function:
Unlike the BSD strcasecmp() function, this only recognizes standard
ASCII letters and ignores the locale, treating all non-ASCII characters
as if they are not letters.

stardict_strcmp() works fine with English characters, but the other
locale characters' sorting is not so good. There should be a _strcmp
function which handles the utf-8 string sorting better. If you know
one, email me ${L_SMILIES_SMILE}$

g_utf8_collate()? This is a locale-dependent funcition. So if you look
up Chinese characters while in the Chinese locale, it works fine. But
if you are in some other locale then the lookup will fail, as the
order is not the same as in the Chinese locale (which was used when
creating the dictionary).

g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..

The complete solution can be found in "Unicode Technical Standard #10: Unicode
Collation Algorithm", http://www.unicode.org/reports/tr10/

I hope glib will provide a locale-independent g_utf8_collate() soon.
http://bugzilla.gnome.org/show_bug.cgi?id=112798

{4}. The ".dict" file's format.
The .dict file is a pure data sequence, as the offset and size of each
word is recorded in the corresponding .idx file.

If the "sametypesequence" option is not used in the .ifo file, then
the .dict file has fields in the following order:
==============
word_1_data_1_type; // a single char identifying the data type
word_1_data_1_data; // the data
word_1_data_2_type;
word_1_data_2_data;
...... // the number of data entries for each word is determined by
// word_data_size in .idx file
word_2_data_1_type;
word_2_data_1_data;
......
==============
It's important to note that each field in each word indicates its
own length, as described below. The number of possible fields per
word is also not fixed, and is determined by simply reading data until
you've read word_data_size bytes for that word.

Suppose the "sametypesequence" option is used in the .idx file, and
the option is set like this:
sametypesequence=tm
Then the .dict file will look like this:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
The first data entry for each word will have a terminating '\0', but
the second entry will not have a terminating '\0'. The omissions of
the type chars and of the last field's size information are the
optimizations required by the "sametypesequence" option described
above.

Type identifiers
----------------
Here are the single-character type identifiers that may be used with
the "sametypesequence" option in the .idx file, or may appear in the
dict file itself if the "sametypesequence" option is not used.

Lower-case characters signify that a field's size is determined by a
terminating '\0', while upper-case characters indicate that the data
begins with a 32-bit integer that gives the length of the data field.

'm'
Word's pure text meaning.
The data should be a utf-8 string ending with '\0'.

'l'
Word's pure text meaning.
The data is NOT a utf-8 string, but is instead a string in locale
encoding, ending with '\0'. Sometimes using this type will save disk
space, but its use is discouraged.

'g'
A utf-8 string which is marked up with the Pango text markup language.
For more information about this markup language, See the "Pango
Reference Manual."
You might have it installed locally at:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html

't'
English phonetic string.
The data should be a utf-8 string ending with '\0'.

Here are some utf-8 phonetic characters:
Î¸ÊÅÊ§Ã°ÊÃ¦Ä±ÊÊÉÉÉÉÉÉËËËË
Ã¦ÉÉÊÓÑÅvÎ¸Ã°ÊÊËÉ¡ËËË

'y'
Chinese YinBiao.
The data should be a utf-8 string ending with '\0'.

'W'
wav file.
The data begins with a network byte-ordered glong to identify the wav
file's size, immediately followed by the file's content.

'P'
png file.
The data begins with a network byte-ordered glong to identify the png
file's size, immediately followed by the file's content.

'X'
this type identifier is reserved for experimental extensions.

{5}. Tree Dictionary
The tree dictionary support is used for information viewing, etc.

A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
and sometreedict.dict.dz.

It is better to compress the .tdx file, as it is always load into memory.

The .ifo file has the following format:

StarDict's treedict ifo file
version=2.4.2
[options]

Available options:

bookname= // required
tdxfilesize= // required
wordcount=
author=
email=
website=
description=
date=
sametypesequence=

wordcount is only used for info view in the dict manage dialog, so it is not
important in tree dictionary.

The .tdx file is just the word list.
-----------
The word list is a tree list of word entries.

Each entry in the word list contains four fields, one after the other:
word_str; // a utf-8 string terminated by '\0'.
word_data_offset; // word data's offset in .dict file
word_data_size; // word data's total size in .dict file. it can be 0.
word_subentry_count; //have many sub word this entry has, 0 means none.

Subentry is immidiately followed by its parent entry. This make the order is
just as when a tree list with all its nodes extended, then sort from top to
bottom.

The .dict file's format is the same as the normal dictionary.

{6}. More information.
You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
"src/tools/*.cpp" for more information.

If you have any questions, email me. ${L_SMILIES_SMILE}$

Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
English.

Hu Zheng <huzheng_001@163.com>
http://forlinux.yeah.net
2003.11.11

kofshower · #3

要学习的话不如加一本sqlite本质,去读sqlite的源码

xyywll · #4

我不太懂，不过 Word 或者 k-tuple 在数据库搜索中常用
Knuth 1973 年提出了允许一次错配的字符串搜索方法，可以参考 The Art of Computer Programming 第二章
这些都是基于过滤 (filtering) 的思想

呵呵，不太懂这方面，只是最近接触类似东西，权当讨论

xyywll · #5

今天遇到了类似问题，发现用后缀树(suffix tree) 来做比较适合
后缀树适合在一个很长的文本中搜索多个模式(pattern)

相关资料：
E.M.Mclreight. A Space-Economical Suffix Tree Construction Algorithm. Jrul of Algorithms,23(2) pp262-272,1976.
P.Weiner. Linear Pattern Matching Algorithms.Proc.14th IEEE Annual Symp on Switching and Automata Theory,pp1-11,1973.
E.Ukkonen. On-line Construction of Suffix Trees. Algorithmica,14(3)pp249-260,1995.
Mark Nelson.Fast String Searching with Suffix Trees.Dr. Dobb's Journal August,1996.

ergod · #6

感觉目前stardict有一个很大的软类，辞典格式有点乱。尤其是longman和oxd的字典，所谓的美化版也是一塌糊涂，

有个windows下的软件，lingoes，不知道大家用过吗，一度曾经因为它不能放弃windows。

LZ有没有兴趣做格式这方面的开发啊，lingoes的作者给出了自己辞典的一些格式，感觉可以借鉴

flyinflash · #7

楼上的，
请搜索 TualatriX （就是写 ubuntu tweak 那个家伙）
他正在使用python写一个翻译软件，有望成为 linux 下的凌格斯

我个人非常支持这类教育 & *GPL 软件的开发
我愿意出钱购买

问下stardict的字典用的是什么数据结构

问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构

Re: 问下stardict的字典用的是什么数据结构