[分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

系统字体配置、中文显示和输入法问题
fracting
帖子: 278
注册时间: 2009-02-26 1:30

[分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#1

帖子 fracting » 2009-03-12 7:21

2010年12月1日更新--此帖作废
由于ibus版本的升级,这个脚本早已经不兼容新版本的ibus,(大概从09年底起就不兼容了).另一方面,sunpinyin的开发者yongsun很早就写出了很方便的导入工具,所以我就没有更新这个帖子,但是想不到这个帖子总是自不量力地跑到google搜索结果的前头,估计很多朋友都是google后进入到这个帖子来的,受到我的误导了,在此跟大家说一声抱歉!
另外,我收集了一些有用的链接放在下面,不管是之前受我误导的,还是将来通过google找到这个帖子的,希望能对你有所帮助!
虽然yongsun的脚本是针对sunpinyin的,不过修改一下应该也能用在ibus-pinyin上,如果哪位朋友改好了或者知道已经有人做出来了请在跟帖中分享一下,我整理到一楼中.如果没有的话,我有时间就动手做一下 :)

导入搜狗输入法的细胞词库:
http://yongsun.me/2010/07/%E5%AF%BC%E5% ... %E5%BA%93/
导入google和sogou输入法的用户词典:
http://yongsun.me/2010/04/%E5%AF%BC%E5% ... %E5%85%B8/
将FIT的用户词典导入SunPinyin用户词典:
http://yongsun.me/2010/04/%E5%B0%86fit% ... %E5%85%B8/
将QIM的用户词典导入SunPinyin用户词典:
http://yongsun.me/2010/04/%E5%B0%86qim% ... %E5%85%B8/
IME Words Library Converter/深蓝词库转换:(目前支持的输入法有: PC端: 搜狗拼音 QQ拼音 QQ五笔(纯汉字) 谷歌拼音 搜狗五笔 紫光拼音 拼音加加 手机端: QQ手机拼音 百度手机拼音)
http://code.google.com/p/imewlconverter/
搜狗scel词库解析(转fcitx词库格式) :
viewtopic.php?f=8&t=250136&start=0



很多人在windows上习惯使用sougou输入法跟google输入法,sougou输入法的词库很强大,更有许多好用的细胞词库可以免费下载,ibus是linux上的一个很好的输入法,不过目前词库仍不够理想,也不支持导入外部词库。我写了一个python脚本,可以把sougou的细胞词库导入到ibus中。
[注意]
1.这里的细胞词库不是指*.scel格式的词库,而是*.txt格式的文本版词库,搜狗的细胞词库下载网页有提供下载 .
2.从网上下载的细胞词库一般是GB18030编码,必须先转为utf-8编码才能使用.
3.脚本代码必须以utf-8编码储存
4.必须安装python 跟 sqlite3
5.使用时必须先关闭ibus,否则写入数据库会出错
6.强烈建议先备份个人词库 ~/.ibus/pinyin/user.db





脚本功能:从一个文本文件中读取词条,如果该词条既不在ibus公共词库(/usr/share/ibus-pinyin/engine/py.db)中,又不在个人的用户词库(~/.ibus/pinyin/user.db)中,则将该词条添加到用户词库中.

使用方法:
1.关闭ibus输入法 (否则写入数据库会出错)
2.运行脚本
2.0 进入 phrase_converter_for_ibus.py 所在目录
2.1 sudo python phrase_converter_for_ibus.py
2.2 根据提示,输入家目录,例如 /home/fracting
2.3 根据提示,输入即将导入的词库的完整路径,例如 /home/fracting/细胞词库1.txt
3.重新启动ibus



存在问题:
1.如果某个字是多音字,则导入时只能导入这个字的第一个拼音,目前想不到好的方案解决,希望有人帮助解决这个问题 ^_^
2.某些字在ibus中不存在,含有这些字的短语无法被导入

其他:
1.向ibus开发者致敬,向一切自由软件开发者致敬!!
2.希望ibus早日原生支持导入外部词库功能.
3.如果有人愿意帮忙共同改进这个小程序,请联系我: fracting@gmail.com
4.将在未来改进的地方(个人想法,欢迎大家多多提意见):
4.0 免除使用中手动输入家目录这么一个烦人的过程 (我刚学python,不知要怎么做,还望前辈多指点^_^) 已解决,感谢 wkt ^_^
4.1 在脚本中实现自动关闭与重启ibus
4.2 实现自动识别并转换编码,免除手动转换导入词库编码这一过程
4.3 实现批量导入
4.4 支持更多格式词库的导入,包括google拼音输入法格式等,以及包含用户词频的词库文件
4.5 支持英文单词导入,自定义短语(包括特殊字符)导入等
4.6 支持其他语言
4.7 支持导出,包括多种格式
4.8 支持各平台上的各种输入法
4.9 支持网络同步 (计划用google app engine 开发一个免费的个人词库托管网站,用于词库网络同步--当然首先要解决隐私安全问题)
4.10 支持个性词库分享 (通过词频分布将使用者划分为不同群体,当某个群体中有共同的新词在部分成员中率先开始被使用时,该群体中的其他成员的词库也会自动添加该新词--当然隐私安全问题依然在首位)
#个人认为,如果能实现第4.10条,那么搜狗的细胞词库功能将成为历史.
5.显然我在短期内无法实现这么多,非常希望有人愿意共同努力,未自由软件贡献自己的一分力量.也希望我有一天能加入到ibus的开发中,各位朋友祝我能够加入ibus团队吧!


P.S. 这是我使用linux半年以来第一次为开源做贡献,我发现只有为开源做过贡献才会更深爱开源---------再见,不,永别了--M$!

以下是全部代码,(请复制后保存为phrases_converter_for_ibus.py,注意编码必须是utf-8,附件发不了^_^):



代码: 全选

#!/usr/bin/python
# -*- encoding: utf-8 -*-
#phrases_converter_for_ibus version 0.9
#code by fracting
#Anyone who want to help please contact to fracting@gmail.com
#Thanks to the Developers of IBUS !

import os
import sqlite3 as sqlite
INV_PINYIN_DICT = {
1 : "a",2 : "ai",3 : "an",4 : "ang",5 : "ao",6 : "ba",7 : "bai",8 : "ban",9 : "bang",10 : "bao",11 : "bei",12 : "ben",13 : "beng",14 : "bi",15 : "bian",16 : "biao",17 : "bie",18 : "bin",19 : "bing",20 : "bo",21 : "bu",22 : "ca",23 : "cai",24 : "can",25 : "cang",26 : "cao",27 : "ce",28 : "cen",29 : "ceng",30 : "ci",31 : "cong",32 : "cou",33 : "cu",34 : "cuan",35 : "cui",36 : "cun",37 : "cuo",38 : "cha",39 : "chai",40 : "chan",41 : "chang",42 : "chao",43 : "che",44 : "chen",45 : "cheng",46 : "chi",47 : "chong",48 : "chou",49 : "chu",50 : "chuai",51 : "chuan",52 : "chuang",53 : "chui",54 : "chun",55 : "chuo",56 : "da",57 : "dai",58 : "dan",59 : "dang",60 : "dao",61 : "de",62 : "dei",63 : "den",64 : "deng",65 : "di",66 : "dia",67 : "dian",68 : "diao",69 : "die",70 : "ding",71 : "diu",72 : "dong",73 : "dou",74 : "du",75 : "duan",76 : "dui",77 : "dun",78 : "duo",79 : "e",80 : "ei",81 : "en",82 : "er",83 : "fa",84 : "fan",85 : "fang",86 : "fei",87 : "fen",88 : "feng",89 : "fo",90 : "fou",91 : "fu",92 : "ga",93 : "gai",94 : "gan",95 : "gang",96 : "gao",97 : "ge",98 : "gei",99 : "gen",100 : "geng",101 : "gong",102 : "gou",103 : "gu",104 : "gua",105 : "guai",106 : "guan",107 : "guang",108 : "gui",109 : "gun",110 : "guo",111 : "ha",112 : "hai",113 : "han",114 : "hang",115 : "hao",116 : "he",117 : "hei",118 : "hen",119 : "heng",120 : "hong",121 : "hou",122 : "hu",123 : "hua",124 : "huai",125 : "huan",126 : "huang",127 : "hui",128 : "hun",129 : "huo",130 : "ji",131 : "jia",132 : "jian",133 : "jiang",134 : "jiao",135 : "jie",136 : "jin",137 : "jing",138 : "jiong",139 : "jiu",140 : "ju",141 : "juan",142 : "jue",143 : "jun",144 : "ka",145 : "kai",146 : "kan",147 : "kang",148 : "kao",149 : "ke",150 : "kei",151 : "ken",152 : "keng",153 : "kong",154 : "kou",155 : "ku",156 : "kua",157 : "kuai",158 : "kuan",159 : "kuang",160 : "kui",161 : "kun",162 : "kuo",163 : "la",164 : "lai",165 : "lan",166 : "lang",167 : "lao",168 : "le",169 : "lei",170 : "leng",171 : "li",172 : "lia",173 : "lian",174 : "liang",175 : "liao",176 : "lie",177 : "lin",178 : "ling",179 : "liu",180 : "lo",181 : "long",182 : "lou",183 : "lu",184 : "luan",185 : "lue",186 : "lun",187 : "luo",188 : "lv",189 : "lve",190 : "ma",191 : "mai",192 : "man",193 : "mang",194 : "mao",195 : "me",196 : "mei",197 : "men",198 : "meng",199 : "mi",200 : "mian",201 : "miao",202 : "mie",203 : "min",204 : "ming",205 : "miu",206 : "mo",207 : "mou",208 : "mu",209 : "na",210 : "nai",211 : "nan",212 : "nang",213 : "nao",214 : "ne",215 : "nei",216 : "nen",217 : "neng",218 : "ni",219 : "nian",220 : "niang",221 : "niao",222 : "nie",223 : "nin",224 : "ning",225 : "niu",226 : "ng",227 : "nong",228 : "nou",229 : "nu",230 : "nuan",231 : "nue",232 : "nuo",233 : "nv",234 : "nve",235 : "o",236 : "ou",237 : "pa",238 : "pai",239 : "pan",240 : "pang",241 : "pao",242 : "pei",243 : "pen",244 : "peng",245 : "pi",246 : "pian",247 : "piao",248 : "pie",249 : "pin",250 : "ping",251 : "po",252 : "pou",253 : "pu",254 : "qi",255 : "qia",256 : "qian",257 : "qiang",258 : "qiao",259 : "qie",260 : "qin",261 : "qing",262 : "qiong",263 : "qiu",264 : "qu",265 : "quan",266 : "que",267 : "qun",268 : "ran",269 : "rang",270 : "rao",271 : "re",272 : "ren",273 : "reng",274 : "ri",275 : "rong",276 : "rou",277 : "ru",278 : "ruan",279 : "rui",280 : "run",281 : "ruo",282 : "sa",283 : "sai",284 : "san",285 : "sang",286 : "sao",287 : "se",288 : "sen",289 : "seng",290 : "si",291 : "song",292 : "sou",293 : "su",294 : "suan",295 : "sui",296 : "sun",297 : "suo",298 : "sha",299 : "shai",300 : "shan",301 : "shang",302 : "shao",303 : "she",304 : "shei",305 : "shen",306 : "sheng",307 : "shi",308 : "shou",309 : "shu",310 : "shua",311 : "shuai",312 : "shuan",313 : "shuang",314 : "shui",315 : "shun",316 : "shuo",317 : "ta",318 : "tai",319 : "tan",320 : "tang",321 : "tao",322 : "te",323 : "tei",324 : "teng",325 : "ti",326 : "tian",327 : "tiao",328 : "tie",329 : "ting",330 : "tong",331 : "tou",332 : "tu",333 : "tuan",334 : "tui",335 : "tun",336 : "tuo",337 : "wa",338 : "wai",339 : "wan",340 : "wang",341 : "wei",342 : "wen",343 : "weng",344 : "wo",345 : "wu",346 : "xi",347 : "xia",348 : "xian",349 : "xiang",350 : "xiao",351 : "xie",352 : "xin",353 : "xing",354 : "xiong",355 : "xiu",356 : "xu",357 : "xuan",358 : "xue",359 : "xun",360 : "ya",361 : "yan",362 : "yang",363 : "yao",364 : "ye",365 : "yi",366 : "yin",367 : "ying",368 : "yo",369 : "yong",370 : "you",371 : "yu",372 : "yuan",373 : "yue",374 : "yun",375 : "za",376 : "zai",377 : "zan",378 : "zang",379 : "zao",380 : "ze",381 : "zei",382 : "zen",383 : "zeng",384 : "zi",385 : "zong",386 : "zou",387 : "zu",388 : "zuan",389 : "zui",390 : "zun",391 : "zuo",392 : "zha",393 : "zhai",394 : "zhan",395 : "zhang",396 : "zhao",397 : "zhe",398 : "zhen",399 : "zheng",400 : "zhi",401 : "zhong",402 : "zhou",403 : "zhu",404 : "zhua",405 : "zhuai",406 : "zhuan",407 : "zhuang",408 : "zhui",409 : "zhun",410 : "zhuo"
              }
    
    
db_in=sqlite.connect('/usr/share/ibus-pinyin/engine/py.db')
cur_in=db_in.cursor()

homedir=os.environ['HOME']
dbname=homedir+'/.ibus/pinyin/user.db'
# dbname='/home/fracting/.ibus/pinyin/user.db'# for test
db_out=sqlite.connect(dbname)
cur_out=db_out.cursor()

filename=raw_input('Please input the filename of the phrase table\n such as: /home/your_account/phrase.txt \n')
# filename='/home/fracting/1.txt' # for test
f=file(filename,'r')


record_num=0
py_db_num=0
user_db_num=0
error_num=0
pinyin=[]
while True:
  record=['','','','','','','','','','','','','']
  line = f.readline()
  l=len(line)-1
  phrase=line[0:l]
  if l == -1: # Zero length indicates EOF
    break
  elif l==0:
    continue
  else:
    cur_in.execute('select * from py_phrase where phrase=?',[phrase])
    exist1=cur_in.fetchone()
    cur_out.execute('select * from py_phrase where phrase=?',[phrase])
    exist2=cur_out.fetchone()
    if exist1 == None and exist2 == None: #判断短语是否已存在
      record_num +=1
      record[0]=l/3
      record[10]=phrase
      record[12]=1
      yx=[]
      for i in range(0,l/3):
        cur_in.execute('select * from py_phrase  where phrase =?',[phrase[i*3:i*3+3]])
        pinyin=cur_in.fetchall()
        if pinyin == []:
          record_num -= 1
          error_num +=1
          break
        if i<4 :
          record[i+1]=pinyin[0][1]
          record[i+6]=pinyin[0][6]
        else:
          yx.append(INV_PINYIN_DICT[pinyin[0][1]])
      record[5]="'".join(yx)
      if pinyin == []:
        print phrase
        continue
      cur_out.execute('insert into py_phrase values (?,?,?,?,?,?,?,?,?,?,?,?,?)',record)
    elif exist1 !=None:
      py_db_num +=1
    elif exist2 !=None:
      user_db_num +=1    

        
cur_in.close()
db_out.commit()
cur_out.close()
f.close()


print py_db_num ,"phrases are already included in py.db"
print user_db_num ,"phrases are already included in user.db"
print record_num ,"new phrases have been imported to user.db"
print error_num ,"phrases can't be imported to user.db since some words of those phrases are not exist in py.db"





上次由 fracting 在 2010-12-01 20:32,总共编辑 3 次。
Wine的使用中的一些常见误区:
viewtopic.php?f=121&t=363147

分享Wine调试经验 -- 第二季: Wine Dr.com 中文乱码
viewtopic.php?f=121&t=385111

做一名开源社区的扫地僧(上)
viewtopic.php?f=80&t=389615
头像
bones7456
帖子: 8495
注册时间: 2006-04-12 20:05
来自: 杭州
联系:

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#2

帖子 bones7456 » 2009-03-12 8:26

楼主别重复发帖啊。
PS: 打开BBcode,可以贴代码。
:em04
关注我的blog: ε==3
wkt
帖子: 849
注册时间: 2006-09-07 22:51
联系:

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#3

帖子 wkt » 2009-03-12 9:03

获取用户目录os.environ['HOME']
fracting
帖子: 278
注册时间: 2009-02-26 1:30

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#4

帖子 fracting » 2009-03-12 10:22

哦,知道了~谢谢,辛苦版主了。之前不知道发贴还要经过审核,也不知道附件上传不了,所以重复发贴了,不好意思。。
谢谢楼上!
Wine的使用中的一些常见误区:
viewtopic.php?f=121&t=363147

分享Wine调试经验 -- 第二季: Wine Dr.com 中文乱码
viewtopic.php?f=121&t=385111

做一名开源社区的扫地僧(上)
viewtopic.php?f=80&t=389615
头像
pentie
帖子: 228
注册时间: 2007-08-27 22:03
来自: http://apt-blog.co.cc/

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#5

帖子 pentie » 2009-03-30 15:48

:em11 不错,我导入了成语词库,弥补了原词库的不足。
头像
pentie
帖子: 228
注册时间: 2007-08-27 22:03
来自: http://apt-blog.co.cc/

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#6

帖子 pentie » 2009-03-30 17:17

感觉ibus的个人词库挺大的,我的10多M了,传到gae比较困难的……
头像
davio3g
帖子: 480
注册时间: 2009-01-06 15:18

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#7

帖子 davio3g » 2009-03-30 18:08

本人用万能五笔,爽得妙不可言。安装详见:
http://www.a0602.com/thread-101-1-1.html
:em11
感谢您的支持! www.tonegoo.com
Havanna
帖子: 813
注册时间: 2008-04-20 12:13
系统: OS X, Gentoo, Win8.1
来自: Shanghai, PRC

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#8

帖子 Havanna » 2009-04-06 23:20

做个记号,回去试试
头像
gcell
帖子: 429
注册时间: 2007-04-30 2:25
来自: 湖南湘潭
联系:

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#9

帖子 gcell » 2009-04-06 23:54

Mark, 最近佳软层出不穷啊!
孰能浊静之以徐清,孰能安动之以馀生!
gcell -- http://gcell.blog.163.com/
头像
zhen0ayanamist
帖子: 26
注册时间: 2008-07-09 18:34
联系:

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#10

帖子 zhen0ayanamist » 2009-04-27 20:53

太美了,非常感谢LZ!
头像
wangdu2002
帖子: 13284
注册时间: 2008-12-13 19:39
来自: 物华天宝人杰地灵

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#11

帖子 wangdu2002 » 2009-04-27 20:57

感谢楼主的辛苦和奉献,虽然我用Fcitx的,仍然支持楼主的工作,很有价值,顶! :em11
行到水穷处,坐看云起时。
海内生明月,天涯共此夕。
--------------------吾本独!
ensonmj
帖子: 58
注册时间: 2007-10-03 14:42

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#13

帖子 ensonmj » 2009-05-15 9:54

顶一下,希望ibus早日支持词库导入
生活需要平淡!
jolc
帖子: 48
注册时间: 2009-05-12 0:04

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#14

帖子 jolc » 2009-05-16 19:55

这个要顶 :em05 :em05 :em05
感谢分享
头像
shellex
帖子: 2180
注册时间: 2007-02-18 19:33
系统: OSX
来自: lyric.im
联系:

Re: [分享][原创]为ibus输入法导入第三方词库 phrase_converter_for_ibus

#15

帖子 shellex » 2009-05-19 16:30

代码: 全选

Traceback (most recent call last):
  File "/home/shellex/scripts/sougou2ibus.py", line 44, in <module>
    cur_in.execute('select * from py_phrase where phrase=?',[phrase])
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
出错鸟~囧
既然你诚心诚意地问了
我就大慈大悲地告诉你
为了防止世界被破坏
为了维护世界的和平
贯彻爱与真实的罪恶
可爱而又迷人的反派角色
武藏,小次郎
我们是穿越银河的火箭队,白洞白色的明天在等着我们。就是这样!!喵~~
回复