Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
iiopsd
elasticsearch-analysis-ik
提交
7d91be50
E
elasticsearch-analysis-ik
项目概览
iiopsd
/
elasticsearch-analysis-ik
与 Fork 源项目一致
从无法访问的项目Fork
通知
4
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
E
elasticsearch-analysis-ik
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
7d91be50
编写于
2月 13, 2013
作者:
weixin_43283383
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
udpate ik to last version,made mode selectable
上级
a6ed160a
变更
39
展开全部
隐藏空白更改
内联
并排
Showing
39 changed file
with
5193 addition
and
3265 deletion
+5193
-3265
README.textile
README.textile
+12
-3
config/ik/main.dic
config/ik/main.dic
+31
-228
config/ik/quantifier.dic
config/ik/quantifier.dic
+5
-1
pom.xml
pom.xml
+6
-1
src/main/java/org/elasticsearch/index/analysis/IkAnalyzerProvider.java
.../org/elasticsearch/index/analysis/IkAnalyzerProvider.java
+4
-25
src/main/java/org/wltea/analyzer/Context.java
src/main/java/org/wltea/analyzer/Context.java
+0
-256
src/main/java/org/wltea/analyzer/IKSegmentation.java
src/main/java/org/wltea/analyzer/IKSegmentation.java
+0
-137
src/main/java/org/wltea/analyzer/cfg/Configuration.java
src/main/java/org/wltea/analyzer/cfg/Configuration.java
+27
-17
src/main/java/org/wltea/analyzer/core/AnalyzeContext.java
src/main/java/org/wltea/analyzer/core/AnalyzeContext.java
+386
-0
src/main/java/org/wltea/analyzer/core/CJKSegmenter.java
src/main/java/org/wltea/analyzer/core/CJKSegmenter.java
+126
-0
src/main/java/org/wltea/analyzer/core/CN_QuantifierSegmenter.java
.../java/org/wltea/analyzer/core/CN_QuantifierSegmenter.java
+242
-0
src/main/java/org/wltea/analyzer/core/CharacterUtil.java
src/main/java/org/wltea/analyzer/core/CharacterUtil.java
+102
-0
src/main/java/org/wltea/analyzer/core/IKArbitrator.java
src/main/java/org/wltea/analyzer/core/IKArbitrator.java
+153
-0
src/main/java/org/wltea/analyzer/core/IKSegmenter.java
src/main/java/org/wltea/analyzer/core/IKSegmenter.java
+154
-0
src/main/java/org/wltea/analyzer/core/ISegmenter.java
src/main/java/org/wltea/analyzer/core/ISegmenter.java
+46
-0
src/main/java/org/wltea/analyzer/core/LetterSegmenter.java
src/main/java/org/wltea/analyzer/core/LetterSegmenter.java
+296
-0
src/main/java/org/wltea/analyzer/core/Lexeme.java
src/main/java/org/wltea/analyzer/core/Lexeme.java
+284
-0
src/main/java/org/wltea/analyzer/core/LexemePath.java
src/main/java/org/wltea/analyzer/core/LexemePath.java
+256
-0
src/main/java/org/wltea/analyzer/core/QuickSortSet.java
src/main/java/org/wltea/analyzer/core/QuickSortSet.java
+239
-0
src/main/java/org/wltea/analyzer/dic/DictSegment.java
src/main/java/org/wltea/analyzer/dic/DictSegment.java
+138
-92
src/main/java/org/wltea/analyzer/dic/Dictionary.java
src/main/java/org/wltea/analyzer/dic/Dictionary.java
+7
-17
src/main/java/org/wltea/analyzer/dic/Hit.java
src/main/java/org/wltea/analyzer/dic/Hit.java
+45
-9
src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java
src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java
+10
-4
src/main/java/org/wltea/analyzer/lucene/IKQueryParser.java
src/main/java/org/wltea/analyzer/lucene/IKQueryParser.java
+0
-420
src/main/java/org/wltea/analyzer/lucene/IKSimilarity.java
src/main/java/org/wltea/analyzer/lucene/IKSimilarity.java
+0
-19
src/main/java/org/wltea/analyzer/lucene/IKTokenizer.java
src/main/java/org/wltea/analyzer/lucene/IKTokenizer.java
+97
-49
src/main/java/org/wltea/analyzer/query/IKQueryExpressionParser.java
...ava/org/wltea/analyzer/query/IKQueryExpressionParser.java
+716
-0
src/main/java/org/wltea/analyzer/query/SWMCQueryBuilder.java
src/main/java/org/wltea/analyzer/query/SWMCQueryBuilder.java
+153
-0
src/main/java/org/wltea/analyzer/sample/IKAnalzyerDemo.java
src/main/java/org/wltea/analyzer/sample/IKAnalzyerDemo.java
+85
-0
src/main/java/org/wltea/analyzer/sample/LuceneIndexAndSearchDemo.java
...a/org/wltea/analyzer/sample/LuceneIndexAndSearchDemo.java
+147
-0
src/main/java/org/wltea/analyzer/seg/CJKSegmenter.java
src/main/java/org/wltea/analyzer/seg/CJKSegmenter.java
+0
-196
src/main/java/org/wltea/analyzer/seg/ISegmenter.java
src/main/java/org/wltea/analyzer/seg/ISegmenter.java
+0
-16
src/main/java/org/wltea/analyzer/seg/LetterSegmenter.java
src/main/java/org/wltea/analyzer/seg/LetterSegmenter.java
+0
-236
src/main/java/org/wltea/analyzer/seg/QuantifierSegmenter.java
...main/java/org/wltea/analyzer/seg/QuantifierSegmenter.java
+0
-612
src/test/java/DictionaryTester.java
src/test/java/DictionaryTester.java
+481
-481
src/test/java/IKAnalyzerDemo.java
src/test/java/IKAnalyzerDemo.java
+97
-97
src/test/java/IKTokenerTest.java
src/test/java/IKTokenerTest.java
+5
-3
src/test/java/SegmentorTester.java
src/test/java/SegmentorTester.java
+345
-345
src/test/java/extended/ik_dict/ext_stopwords/ext_stopword.dic
...test/java/extended/ik_dict/ext_stopwords/ext_stopword.dic
+498
-1
未找到文件。
README.textile
浏览文件 @
7d91be50
...
...
@@ -19,10 +19,14 @@ In order to install the plugin, simply run:
<pre>
cd bin
plugin -install medcl/elasticsearch-analysis-ik/1.1.3
<del>plugin -install medcl/elasticsearch-analysis-ik/1.1.3</del>
</pre>
also download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/ik
<del>also download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/ik</del>
now you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf)
https://github.com/medcl/elasticsearch-rtf/tree/master/elasticsearch/plugins/analysis-ik
https://github.com/medcl/elasticsearch-rtf/tree/master/elasticsearch/config/ik
<pre>
cd config
...
...
@@ -62,12 +66,17 @@ index:
ik:
alias: [ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
ik_smart:
type: ik
use_smart: true
</pre>
Or
<pre>
index.analysis.analyzer.ik.type : "ik"
</pre>
you can set your prefer segment mode,default `use_smart` is false.
Mapping Configuration
-------------
...
...
config/ik/main.dic
浏览文件 @
7d91be50
一一列举
A股
B股
AB股
H股
K线
QQ宠物
QQ飞车
U盘
Hold住
一一列举
一一对应
一一道来
一丁
...
...
@@ -5334,12 +5343,10 @@
不买
不买账
不乱
不了
不了不当
不了了之
不了情
不了而了
不了解
不予
不予承认
不予理睬
...
...
@@ -10118,7 +10125,6 @@
个别辅导
个协
个唱
个大
个头
个头儿
个子
...
...
@@ -13619,6 +13625,7 @@
乌龙
乌龙球
乌龙茶
乌龙茶工作室
乌龙院
乌龙驹
乌龟
...
...
@@ -20471,6 +20478,7 @@
仕宦
仕进
仕途
他
他乡
他乡人
他乡异县
...
...
@@ -21047,7 +21055,6 @@
以其
以其人之道
以其人之道还治其人之身
以其人之道,还治其人之身
以其昏昏
以其昏昏使人昭昭
以其真正形式付款
...
...
@@ -21261,7 +21268,7 @@
以父之名
以牙还牙
以狸至鼠
以
狸致鼠、以
冰致绳
以冰致绳
以狸饵鼠
以玉抵乌
以玉抵鹊
...
...
@@ -24053,7 +24060,6 @@
住宅和
住宅小区
住宅布局
住宅建筑企划委员会
住宅建设
住宅房
住宅楼
...
...
@@ -25055,6 +25061,7 @@
佞笑
佞臣
佟湘玉
你
你一言
你一言我一语
你中有我
...
...
@@ -26323,7 +26330,6 @@
保卫人员
保卫和平
保卫国家
保卫国家主权和民族资源
保卫处
保卫工作
保卫战
...
...
@@ -27709,7 +27715,6 @@
倜傥不羁
倜傥不群
借东风
借东风丧偶案犯护
借个
借个火
借书
...
...
@@ -28560,6 +28565,7 @@
偕生之疾
偕老
偕行
做的
做一个
做一天和尚撞一天钟
做一套
...
...
@@ -31887,7 +31893,6 @@
全彩屏
全心
全心全意
全心全意为人民服务
全心投入
全总
全息
...
...
@@ -32209,7 +32214,6 @@
全面推行
全面提高
全面禁止
全面禁止和彻底销毁核武器
全面继承
全面落实
全面规划
...
...
@@ -32984,7 +32988,6 @@
公测
公测版
公海
公海海底海床和平利用特别委员会
公海自由
公演
公然
...
...
@@ -40772,6 +40775,7 @@
分香卖履
分驾
分龄
切
切上
切上去
切上来
...
...
@@ -40781,6 +40785,8 @@
切不
切不可
切丝
切的
切得
切个
切中
切中时弊
...
...
@@ -43344,7 +43350,6 @@
前事
前事不忘
前事不忘后事之师
前事不忘,后事之师
前五强
前些
前些天
...
...
@@ -45840,7 +45845,6 @@
劳动厅
劳动合同
劳动和社会保障部
劳动和社会保障部部长
劳动地域分工
劳动基准
劳动基准法
...
...
@@ -46696,7 +46700,6 @@
化学农药
化学分子
化学分析
化学分析电子能电子能谱谱学
化学剂
化学剂注入组块
化学剥蚀
...
...
@@ -46963,7 +46966,6 @@
北京市
北京市区
北京市委
北京市新方世纪科技有限公司
北京市民
北京师范大学
北京房
...
...
@@ -47429,7 +47431,6 @@
区分效度
区分法
区分符
区分能力倾向测验
区划
区划图
区别
...
...
@@ -47995,22 +47996,6 @@
十员
十周
十周年
十四
十四个
十四中
十四人
十四元
十四分
十四号
十四块
十四大
十四天
十四届
十四岁
十四日
十四时
十四行
十四行诗
十回
十团
十围五攻
...
...
@@ -48121,7 +48106,6 @@
十年教训
十年树木
十年树木百年树人
十年树木,百年树人
十年浩劫
十年生聚
十年生聚十年教训
...
...
@@ -63598,7 +63582,6 @@
和暖
和曲
和服
和服务
和村
和林格尔
和林格尔县
...
...
@@ -64478,7 +64461,7 @@
哑巴吃黄
哑巴吃黄莲
哑巴吃黄连
哑巴吃黄连
,
有苦说不出
哑巴吃黄连有苦说不出
哑弹
哑梢公
哑火
...
...
@@ -67026,7 +67009,6 @@
四出
四出戏
四出活动
四分
四分之一
四分之一波长变换器
四分之三
...
...
@@ -67036,7 +67018,6 @@
四分五落
四分五裂
四分天下
四分开
四分法
四分钟
四分音符
...
...
@@ -67048,34 +67029,7 @@
四化建设
四匹
四区
四十
四十一
四十一中
四十七
四十七中
四十万
四十三
四十三中
四十不惑
四十中
四十九
四十九中
四十二
四十二中
四十五
四十五中
四十八
四十八中
四十六
四十六中
四十四
四十四中
四千
四千万
四千个
四千人
四千元
四千块
四叔
四叠体
四口
...
...
@@ -69139,7 +69093,6 @@
国防科
国防科学技术
国防科学技术委员会
国防科学技术工业委员
国防科学技术工业委员会
国防科工委
国防科技
...
...
@@ -70263,10 +70216,6 @@
圣驾
圣骑士
圣龙魔袍
在一定历史条件下
在一定程度上
在一定范围内
在一般情况下
在一起
在一边
在三
...
...
@@ -81739,13 +81688,11 @@
奸邪
奸险
奸雄
她
上
她
她上去
她上来
她下
她下去
她下来
她不
她不会
她不是
她与
...
...
@@ -98328,7 +98275,6 @@
平可夫
平台
平台梁
平台-海岸无线电系统
平和
平和县
平喉
...
...
@@ -111772,18 +111718,11 @@
意表
意见
意见书
意见分歧
意见反馈
意见建议
意见沟通
意见箱
意见簿
意见调查
意见调查表
意识
意识到
意识形态
意识形态领域
意识流
意译
意谓
...
...
@@ -113117,6 +113056,7 @@
成鱼
成龙
成龙配套
我
我为人人
我为你
我为歌狂
...
...
@@ -114159,6 +114099,7 @@
扉用
扉画
扉页
手
手三里
手上
手下
...
...
@@ -114806,7 +114747,6 @@
打保票
打信号
打倒
打倒日本帝国主义
打假
打先锋
打光
...
...
@@ -116630,7 +116570,6 @@
承建
承建商
承建方
承建项目
承当
承德
承德县
...
...
@@ -116645,13 +116584,7 @@
承担
承担义务
承担人
承担责任
承担费用
承担违约赔偿责任
承担重任
承担风险
承接
承接国内外
承揽
承教
承星履草
...
...
@@ -124773,7 +124706,6 @@
提供
提供优良服务
提供优质服务
提供午餐的走读学生
提供商
提供情报
提供援助
...
...
@@ -124987,30 +124919,8 @@
提领
提高
提高了
提高产品质量
提高产量
提高到
提高到一个新的阶段
提高到新的阶段
提高劳动效率
提高劳动生产率
提高单位面积产量
提高工作效率
提高技术
提高效率
提高效益
提高水平
提高班
提高生产率
提高生活水平
提高素质
提高经济效益
提高经济效益为中心
提高自学
提高觉悟
提高警惕
提高认识
提高质量
插一杠子
插一脚
插上
...
...
@@ -125029,12 +124939,9 @@
插值性质
插值逼近
插入
插入序列
插入式注水泥接箍
插入损耗
插入排序
插入方式
插入方法
插入法
插入物
插入者
...
...
@@ -126280,7 +126187,6 @@
摩尔气体常数
摩尔热容
摩尔维亚
摩尔质量排除极限
摩尔达维亚
摩崖
摩弄
...
...
@@ -130873,27 +130779,10 @@
文代会
文以载道
文件
文件事件
文件传输
文件传送、存取和管理
文件名
文件名扩展
文件名称
文件大小
文件夹
文件存储器
文件属性
文件批量
文件服务器
文件柜
文件格式
文件汇编
文件类型
文件精神
文件系统
文件组织
文件维护
文件翻译
文件袋
文传
文似其人
...
...
@@ -132227,11 +132116,9 @@
新一佳
新一季
新一届
新一届中央领导集体
新一期
新一波
新一轮
新一轮军备竞赛
新一集
新丁
新三样
...
...
@@ -132241,7 +132128,6 @@
新世界论坛
新世纪
新世纪福音战士
新世纪通行证
新东
新东安
新东家
...
...
@@ -132294,7 +132180,6 @@
新仙剑奇侠传
新任
新任务
新任国务院副总理
新会
新会区
新会县
...
...
@@ -133655,7 +133540,7 @@
旁观者
旁观者效应
旁观者清
旁观者清,
当事者迷
当事者迷
旁证
旁证博引
旁路
...
...
@@ -134161,7 +134046,7 @@
无可否认
无可奈何
无可奈何花落去
无可奈何花落去似曾相似燕
似曾相似燕归来
无可奉告
无可如何
无可安慰
...
...
@@ -135407,15 +135292,7 @@
日已三竿
日币
日常
日常事务
日常工作
日常支出
日常清洁卫生管理
日常生活
日常生活型
日常用品
日常用语
日常行为
日异月新
日异月更
日异月殊
...
...
@@ -135515,7 +135392,6 @@
日本化
日本史
日本国
日本国际贸易促进会
日本天皇
日本女
日本妞
...
...
@@ -140213,6 +140089,7 @@
月黑杀人
月黑风高
月龄
有
有一利必有一弊
有一得一
有一手
...
...
@@ -141295,7 +141172,6 @@
望谟县
望远
望远镜
望都
望都县
望门
望门寡
...
...
@@ -142559,7 +142435,6 @@
本省人
本真
本着
本着实事求是的原则
本社
本社讯
本神
...
...
@@ -176021,7 +175896,6 @@
独桅
独桅艇
独此一家
独此一家别无分店
独步
独步一时
独步天下
...
...
@@ -179466,7 +179340,6 @@
生产关系
生产分离器
生产力
生产力与生产关系
生产力布局
生产劳动
生产单位
...
...
@@ -181082,7 +180955,6 @@
电子器件
电子器材
电子回旋共振加热
电子回旋共振加热化学专业词汇
电子图书
电子地图
电子城
...
...
@@ -184948,6 +184820,7 @@
皂隶
皂靴
皂鞋
的
的一确二
的人
的卡
...
...
@@ -187254,7 +187127,6 @@
省直辖县级行政单位
省直辖行政单位
省省
省福发股份有限公司
省科委
省称
省立
...
...
@@ -190793,23 +190665,14 @@
确守信义
确守合同
确定
确定会
确定和随机佩特里网
确定型上下文有关语言
确定性
确定性反褶积
确定时间
确定是
确定有
确定能
确实
确实会
确实可靠
确实在
确实性
确实是
确实有
确实能
确属
确山
确山县
...
...
@@ -198198,7 +198061,6 @@
第三关
第三册
第三军
第三十
第三卷
第三只
第三台
...
...
@@ -198274,8 +198136,6 @@
第九城市
第九天
第九届
第九届人民代表大会
第九届全国人民代表大会
第九期
第九条
第九次
...
...
@@ -198459,21 +198319,6 @@
第几章
第几节
第几课
第十
第十一
第十一届
第十七
第十三
第十个
第十个五年计划
第十九
第十二
第十二届
第十五
第十五次全国代表大会
第十位
第十八
第十六
第十册
第十卷
第十名
...
...
@@ -198492,7 +198337,6 @@
第十轮
第十部
第十集
第号
第四
第四个
第四产业
...
...
@@ -227177,6 +227021,7 @@
覆雨翻云
覆鹿寻蕉
覈实
见
见一面
见上图
见上帝
...
...
@@ -227214,6 +227059,8 @@
见仁见志
见仁见智
见你
见他
见她
见信
见信好
见光
...
...
@@ -231809,6 +231656,7 @@
诳诞
诳语
诳骗
说的
说一不二
说一些
说一声
...
...
@@ -240547,7 +240395,6 @@
软件网
软件能
软件设计
软件资产管理程序
软件资源
软件超市
软件部
...
...
@@ -242095,11 +241942,6 @@
达克罗
达克罗宁
达到
达到一个新的水平
达到历史最高水平
达到目标
达到顶点
达到高潮
达力达
达卡
达县
...
...
@@ -246993,33 +246835,11 @@
通迅
通过
通过了
通过会议
通过信号机
通过决议
通过去
通过参观
通过商量
通过培养
通过培训
通过外交途径进行谈判
通过学习
通过实践
通过审查
通过批评
通过教育
通过来
通过率
通过考察
通过考核
通过考试
通过能力
通过表演
通过观察
通过讨论
通过训练
通过议案
通过调查
通过鉴定
通运
通运公司
通进
...
...
@@ -247288,11 +247108,6 @@
造恶不悛
造成
造成了
造成危害
造成堕落
造成直接经济损失
造成真空
造扣
造斜工具
造斜点
造极登峰
...
...
@@ -250565,11 +250380,6 @@
采去
采及葑菲
采取
采取不正当手段
采取不正当的手段
采取协调行动
采取多种形式
采取措施
采回
采回去
采回来
...
...
@@ -250624,8 +250434,6 @@
采珠
采用
采用到
采用秘密窃取的手段
采用秘密窃取的方法
采石
采石厂
采石场
...
...
@@ -250633,9 +250441,6 @@
采矿
采矿业
采矿工
采矿工业
采矿工程
采矿方法
采矿权
采矿点
采砂船
...
...
@@ -264178,8 +263983,7 @@
面向农村
面向基层
面向对象分析
面向对象数据库语言
面向对象的体系结构
面向对象
面向市场
面向未来
面向现代化
...
...
@@ -270421,7 +270225,6 @@
高举深藏
高举着
高举远蹈
高举邓小平理论的伟大旗帜
高义
高义薄云
高义薄云天
config/ik/quantifier.dic
浏览文件 @
7d91be50
...
...
@@ -39,6 +39,7 @@
刀
分
分钟
分米
划
列
则
...
...
@@ -58,6 +59,7 @@
卷
厅
厘
厘米
双
发
口
...
...
@@ -144,7 +146,6 @@
把
折
担
拉
拍
招
拨
...
...
@@ -198,6 +199,9 @@
段
毛
毫
毫升
毫米
毫克
池
洲
派
...
...
pom.xml
浏览文件 @
7d91be50
...
...
@@ -6,7 +6,7 @@
<modelVersion>
4.0.0
</modelVersion>
<groupId>
org.elasticsearch
</groupId>
<artifactId>
elasticsearch-analysis-ik
</artifactId>
<version>
1.1.
3
</version>
<version>
1.1.
4
</version>
<packaging>
jar
</packaging>
<description>
IK Analyzer for ElasticSearch
</description>
<inceptionYear>
2009
</inceptionYear>
...
...
@@ -72,6 +72,11 @@
<version>
1.3.RC2
</version>
<scope>
test
</scope>
</dependency>
<dependency>
<groupId>
junit
</groupId>
<artifactId>
junit
</artifactId>
<version>
4.10
</version>
</dependency>
</dependencies>
<build>
...
...
src/main/java/org/elasticsearch/index/analysis/IkAnalyzerProvider.java
浏览文件 @
7d91be50
package
org.elasticsearch.index.analysis
;
import
org.apache.lucene.analysis.Analyzer
;
import
org.elasticsearch.common.inject.Inject
;
import
org.elasticsearch.common.inject.assistedinject.Assisted
;
import
org.elasticsearch.common.logging.ESLogger
;
import
org.elasticsearch.common.settings.Settings
;
import
org.elasticsearch.env.Environment
;
import
org.elasticsearch.index.Index
;
import
org.elasticsearch.index.settings.IndexSettings
;
import
org.wltea.analyzer.dic.Dictionary
;
import
org.wltea.analyzer.lucene.IKAnalyzer
;
import
org.elasticsearch.common.logging.ESLogger
;
import
org.elasticsearch.common.logging.Loggers
;
public
class
IkAnalyzerProvider
extends
AbstractIndexAnalyzerProvider
<
IKAnalyzer
>
{
private
final
IKAnalyzer
analyzer
;
...
...
@@ -18,37 +15,19 @@ public class IkAnalyzerProvider extends AbstractIndexAnalyzerProvider<IKAnalyzer
@Inject
public
IkAnalyzerProvider
(
Index
index
,
@IndexSettings
Settings
indexSettings
,
Environment
env
,
@Assisted
String
name
,
@Assisted
Settings
settings
)
{
super
(
index
,
indexSettings
,
name
,
settings
);
// logger = Loggers.getLogger("ik-analyzer");
//
// logger.info("[Setting] {}",settings.getAsMap().toString());
// logger.info("[Index Setting] {}",indexSettings.getAsMap().toString());
// logger.info("[Env Setting] {}",env.configFile());
analyzer
=
new
IKAnalyzer
(
indexSettings
);
analyzer
=
new
IKAnalyzer
(
indexSettings
,
settings
);
}
/* @Override
public String name() {
return "ik";
}
@Override
public AnalyzerScope scope() {
return AnalyzerScope.INDEX;
}*/
public
IkAnalyzerProvider
(
Index
index
,
Settings
indexSettings
,
String
name
,
Settings
settings
)
{
super
(
index
,
indexSettings
,
name
,
settings
);
analyzer
=
new
IKAnalyzer
(
indexSettings
);
analyzer
=
new
IKAnalyzer
(
indexSettings
,
settings
);
}
public
IkAnalyzerProvider
(
Index
index
,
Settings
indexSettings
,
String
prefixSettings
,
String
name
,
Settings
settings
)
{
super
(
index
,
indexSettings
,
prefixSettings
,
name
,
settings
);
analyzer
=
new
IKAnalyzer
(
indexSettings
);
analyzer
=
new
IKAnalyzer
(
indexSettings
,
settings
);
}
...
...
src/main/java/org/wltea/analyzer/Context.java
已删除
100644 → 0
浏览文件 @
a6ed160a
package
org.wltea.analyzer
;
import
java.util.HashSet
;
import
java.util.Set
;
import
org.wltea.analyzer.dic.Dictionary
;
import
org.wltea.analyzer.seg.ISegmenter
;
public
class
Context
{
private
boolean
isMaxWordLength
=
false
;
private
int
buffOffset
;
private
int
available
;
private
int
lastAnalyzed
;
private
int
cursor
;
private
char
[]
segmentBuff
;
private
Set
<
ISegmenter
>
buffLocker
;
private
IKSortedLinkSet
lexemeSet
;
Context
(
char
[]
segmentBuff
,
boolean
isMaxWordLength
){
this
.
isMaxWordLength
=
isMaxWordLength
;
this
.
segmentBuff
=
segmentBuff
;
this
.
buffLocker
=
new
HashSet
<
ISegmenter
>(
4
);
this
.
lexemeSet
=
new
IKSortedLinkSet
();
}
public
void
resetContext
(){
buffLocker
.
clear
();
lexemeSet
=
new
IKSortedLinkSet
();
buffOffset
=
0
;
available
=
0
;
lastAnalyzed
=
0
;
cursor
=
0
;
}
public
boolean
isMaxWordLength
()
{
return
isMaxWordLength
;
}
public
void
setMaxWordLength
(
boolean
isMaxWordLength
)
{
this
.
isMaxWordLength
=
isMaxWordLength
;
}
public
int
getBuffOffset
()
{
return
buffOffset
;
}
public
void
setBuffOffset
(
int
buffOffset
)
{
this
.
buffOffset
=
buffOffset
;
}
public
int
getLastAnalyzed
()
{
return
lastAnalyzed
;
}
public
void
setLastAnalyzed
(
int
lastAnalyzed
)
{
this
.
lastAnalyzed
=
lastAnalyzed
;
}
public
int
getCursor
()
{
return
cursor
;
}
public
void
setCursor
(
int
cursor
)
{
this
.
cursor
=
cursor
;
}
public
void
lockBuffer
(
ISegmenter
segmenter
){
this
.
buffLocker
.
add
(
segmenter
);
}
public
void
unlockBuffer
(
ISegmenter
segmenter
){
this
.
buffLocker
.
remove
(
segmenter
);
}
public
boolean
isBufferLocked
(){
return
this
.
buffLocker
.
size
()
>
0
;
}
public
int
getAvailable
()
{
return
available
;
}
public
void
setAvailable
(
int
available
)
{
this
.
available
=
available
;
}
public
Lexeme
firstLexeme
()
{
return
this
.
lexemeSet
.
pollFirst
();
}
public
Lexeme
lastLexeme
()
{
return
this
.
lexemeSet
.
pollLast
();
}
public
void
addLexeme
(
Lexeme
lexeme
){
if
(!
Dictionary
.
isStopWord
(
segmentBuff
,
lexeme
.
getBegin
()
,
lexeme
.
getLength
())){
this
.
lexemeSet
.
addLexeme
(
lexeme
);
}
}
public
int
getResultSize
(){
return
this
.
lexemeSet
.
size
();
}
public
void
excludeOverlap
(){
this
.
lexemeSet
.
excludeOverlap
();
}
private
class
IKSortedLinkSet
{
private
Lexeme
head
;
private
Lexeme
tail
;
private
int
size
;
private
IKSortedLinkSet
(){
this
.
size
=
0
;
}
private
void
addLexeme
(
Lexeme
lexeme
){
if
(
this
.
size
==
0
){
this
.
head
=
lexeme
;
this
.
tail
=
lexeme
;
this
.
size
++;
return
;
}
else
{
if
(
this
.
tail
.
compareTo
(
lexeme
)
==
0
){
return
;
}
else
if
(
this
.
tail
.
compareTo
(
lexeme
)
<
0
){
this
.
tail
.
setNext
(
lexeme
);
lexeme
.
setPrev
(
this
.
tail
);
this
.
tail
=
lexeme
;
this
.
size
++;
return
;
}
else
if
(
this
.
head
.
compareTo
(
lexeme
)
>
0
){
this
.
head
.
setPrev
(
lexeme
);
lexeme
.
setNext
(
this
.
head
);
this
.
head
=
lexeme
;
this
.
size
++;
return
;
}
else
{
Lexeme
l
=
this
.
tail
;
while
(
l
!=
null
&&
l
.
compareTo
(
lexeme
)
>
0
){
l
=
l
.
getPrev
();
}
if
(
l
.
compareTo
(
lexeme
)
==
0
){
return
;
}
else
if
(
l
.
compareTo
(
lexeme
)
<
0
){
lexeme
.
setPrev
(
l
);
lexeme
.
setNext
(
l
.
getNext
());
l
.
getNext
().
setPrev
(
lexeme
);
l
.
setNext
(
lexeme
);
this
.
size
++;
return
;
}
}
}
}
private
Lexeme
pollFirst
(){
if
(
this
.
size
==
1
){
Lexeme
first
=
this
.
head
;
this
.
head
=
null
;
this
.
tail
=
null
;
this
.
size
--;
return
first
;
}
else
if
(
this
.
size
>
1
){
Lexeme
first
=
this
.
head
;
this
.
head
=
first
.
getNext
();
first
.
setNext
(
null
);
this
.
size
--;
return
first
;
}
else
{
return
null
;
}
}
private
Lexeme
pollLast
(){
if
(
this
.
size
==
1
){
Lexeme
last
=
this
.
head
;
this
.
head
=
null
;
this
.
tail
=
null
;
this
.
size
--;
return
last
;
}
else
if
(
this
.
size
>
1
){
Lexeme
last
=
this
.
tail
;
this
.
tail
=
last
.
getPrev
();
last
.
setPrev
(
null
);
this
.
size
--;
return
last
;
}
else
{
return
null
;
}
}
private
void
excludeOverlap
(){
if
(
this
.
size
>
1
){
Lexeme
one
=
this
.
head
;
Lexeme
another
=
one
.
getNext
();
do
{
if
(
one
.
isOverlap
(
another
)
&&
Lexeme
.
TYPE_CJK_NORMAL
==
one
.
getLexemeType
()
&&
Lexeme
.
TYPE_CJK_NORMAL
==
another
.
getLexemeType
()){
another
=
another
.
getNext
();
one
.
setNext
(
another
);
if
(
another
!=
null
){
another
.
setPrev
(
one
);
}
this
.
size
--;
}
else
{
one
=
another
;
another
=
another
.
getNext
();
}
}
while
(
another
!=
null
);
}
}
private
int
size
(){
return
this
.
size
;
}
}
}
src/main/java/org/wltea/analyzer/IKSegmentation.java
已删除
100644 → 0
浏览文件 @
a6ed160a
/**
*
*/
package
org.wltea.analyzer
;
import
java.io.IOException
;
import
java.io.Reader
;
import
java.util.List
;
import
org.wltea.analyzer.cfg.Configuration
;
import
org.wltea.analyzer.help.CharacterHelper
;
import
org.wltea.analyzer.seg.ISegmenter
;
public
final
class
IKSegmentation
{
private
Reader
input
;
private
static
final
int
BUFF_SIZE
=
3072
;
private
static
final
int
BUFF_EXHAUST_CRITICAL
=
48
;
private
char
[]
segmentBuff
;
private
Context
context
;
private
List
<
ISegmenter
>
segmenters
;
public
IKSegmentation
(
Reader
input
){
this
(
input
,
false
);
}
public
IKSegmentation
(
Reader
input
,
boolean
isMaxWordLength
){
this
.
input
=
input
;
segmentBuff
=
new
char
[
BUFF_SIZE
];
context
=
new
Context
(
segmentBuff
,
isMaxWordLength
);
segmenters
=
Configuration
.
loadSegmenter
();
}
public
synchronized
Lexeme
next
()
throws
IOException
{
if
(
context
.
getResultSize
()
==
0
){
/*
* 从reader中读取数据,填充buffer
* 如果reader是分次读入buffer的,那么buffer要进行移位处理
* 移位处理上次读入的但未处理的数据
*/
int
available
=
fillBuffer
(
input
);
if
(
available
<=
0
){
context
.
resetContext
();
return
null
;
}
else
{
int
buffIndex
=
0
;
for
(
;
buffIndex
<
available
;
buffIndex
++){
context
.
setCursor
(
buffIndex
);
segmentBuff
[
buffIndex
]
=
CharacterHelper
.
regularize
(
segmentBuff
[
buffIndex
]);
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
nextLexeme
(
segmentBuff
,
context
);
}
/*
* 满足一下条件时,
* 1.available == BUFF_SIZE 表示buffer满载
* 2.buffIndex < available - 1 && buffIndex > available - BUFF_EXHAUST_CRITICAL表示当前指针处于临界区内
* 3.!context.isBufferLocked()表示没有segmenter在占用buffer
* 要中断当前循环(buffer要进行移位,并再读取数据的操作)
*/
if
(
available
==
BUFF_SIZE
&&
buffIndex
<
available
-
1
&&
buffIndex
>
available
-
BUFF_EXHAUST_CRITICAL
&&
!
context
.
isBufferLocked
()){
break
;
}
}
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
reset
();
}
context
.
setLastAnalyzed
(
buffIndex
);
context
.
setBuffOffset
(
context
.
getBuffOffset
()
+
buffIndex
);
if
(
context
.
isMaxWordLength
()){
context
.
excludeOverlap
();
}
return
buildLexeme
(
context
.
firstLexeme
());
}
}
else
{
return
buildLexeme
(
context
.
firstLexeme
());
}
}
private
int
fillBuffer
(
Reader
reader
)
throws
IOException
{
int
readCount
=
0
;
if
(
context
.
getBuffOffset
()
==
0
){
readCount
=
reader
.
read
(
segmentBuff
);
}
else
{
int
offset
=
context
.
getAvailable
()
-
context
.
getLastAnalyzed
();
if
(
offset
>
0
){
System
.
arraycopy
(
segmentBuff
,
context
.
getLastAnalyzed
()
,
this
.
segmentBuff
,
0
,
offset
);
readCount
=
offset
;
}
readCount
+=
reader
.
read
(
segmentBuff
,
offset
,
BUFF_SIZE
-
offset
);
}
context
.
setAvailable
(
readCount
);
return
readCount
;
}
private
Lexeme
buildLexeme
(
Lexeme
lexeme
){
if
(
lexeme
!=
null
){
lexeme
.
setLexemeText
(
String
.
valueOf
(
segmentBuff
,
lexeme
.
getBegin
()
,
lexeme
.
getLength
()));
return
lexeme
;
}
else
{
return
null
;
}
}
public
synchronized
void
reset
(
Reader
input
)
{
this
.
input
=
input
;
context
.
resetContext
();
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
reset
();
}
}
}
src/main/java/org/wltea/analyzer/cfg/Configuration.java
浏览文件 @
7d91be50
...
...
@@ -7,10 +7,6 @@ import org.elasticsearch.common.logging.ESLogger;
import
org.elasticsearch.common.logging.Loggers
;
import
org.elasticsearch.common.settings.Settings
;
import
org.elasticsearch.env.Environment
;
import
org.wltea.analyzer.seg.CJKSegmenter
;
import
org.wltea.analyzer.seg.ISegmenter
;
import
org.wltea.analyzer.seg.LetterSegmenter
;
import
org.wltea.analyzer.seg.QuantifierSegmenter
;
import
java.io.*
;
import
java.util.ArrayList
;
...
...
@@ -18,8 +14,6 @@ import java.util.InvalidPropertiesFormatException;
import
java.util.List
;
import
java.util.Properties
;
import
static
org
.
wltea
.
analyzer
.
dic
.
Dictionary
.
getInstance
;
public
class
Configuration
{
private
static
String
FILE_NAME
=
"ik/IKAnalyzer.cfg.xml"
;
...
...
@@ -27,6 +21,10 @@ public class Configuration {
private
static
final
String
EXT_STOP
=
"ext_stopwords"
;
private
static
ESLogger
logger
=
null
;
private
Properties
props
;
/*
* 是否使用smart方式分词
*/
private
boolean
useSmart
=
true
;
public
Configuration
(
Settings
settings
){
...
...
@@ -34,7 +32,8 @@ public class Configuration {
props
=
new
Properties
();
Environment
environment
=
new
Environment
(
settings
);
File
fileConfig
=
new
File
(
environment
.
configFile
(),
FILE_NAME
);
InputStream
input
=
null
;
// Configuration.class.getResourceAsStream(FILE_NAME);
InputStream
input
=
null
;
try
{
input
=
new
FileInputStream
(
fileConfig
);
}
catch
(
FileNotFoundException
e
)
{
...
...
@@ -52,7 +51,27 @@ public class Configuration {
}
}
public
List
<
String
>
getExtDictionarys
(){
/**
* 返回useSmart标志位
* useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
* @return useSmart
*/
public
boolean
useSmart
()
{
return
useSmart
;
}
/**
* 设置useSmart标志位
* useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
* @param useSmart
*/
public
void
setUseSmart
(
boolean
useSmart
)
{
this
.
useSmart
=
useSmart
;
}
public
List
<
String
>
getExtDictionarys
(){
List
<
String
>
extDictFiles
=
new
ArrayList
<
String
>(
2
);
String
extDictCfg
=
props
.
getProperty
(
EXT_DICT
);
if
(
extDictCfg
!=
null
){
...
...
@@ -89,13 +108,4 @@ public class Configuration {
}
return
extStopWordDictFiles
;
}
public
static
List
<
ISegmenter
>
loadSegmenter
(){
getInstance
();
List
<
ISegmenter
>
segmenters
=
new
ArrayList
<
ISegmenter
>(
4
);
segmenters
.
add
(
new
QuantifierSegmenter
());
segmenters
.
add
(
new
LetterSegmenter
());
segmenters
.
add
(
new
CJKSegmenter
());
return
segmenters
;
}
}
src/main/java/org/wltea/analyzer/core/AnalyzeContext.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
import
org.wltea.analyzer.dic.Dictionary
;
import
java.io.IOException
;
import
java.io.Reader
;
import
java.util.*
;
/**
*
* 分词器上下文状态
*
*/
class
AnalyzeContext
{
//默认缓冲区大小
private
static
final
int
BUFF_SIZE
=
4096
;
//缓冲区耗尽的临界值
private
static
final
int
BUFF_EXHAUST_CRITICAL
=
100
;
//字符窜读取缓冲
private
char
[]
segmentBuff
;
//字符类型数组
private
int
[]
charTypes
;
//记录Reader内已分析的字串总长度
//在分多段分析词元时,该变量累计当前的segmentBuff相对于reader起始位置的位移
private
int
buffOffset
;
//当前缓冲区位置指针
private
int
cursor
;
//最近一次读入的,可处理的字串长度
private
int
available
;
//子分词器锁
//该集合非空,说明有子分词器在占用segmentBuff
private
Set
<
String
>
buffLocker
;
//原始分词结果集合,未经歧义处理
private
QuickSortSet
orgLexemes
;
//LexemePath位置索引表
private
Map
<
Integer
,
LexemePath
>
pathMap
;
//最终分词结果集
private
LinkedList
<
Lexeme
>
results
;
//分词器配置项
private
boolean
useSmart
;
public
AnalyzeContext
(
boolean
useSmart
){
this
.
useSmart
=
useSmart
;
this
.
segmentBuff
=
new
char
[
BUFF_SIZE
];
this
.
charTypes
=
new
int
[
BUFF_SIZE
];
this
.
buffLocker
=
new
HashSet
<
String
>();
this
.
orgLexemes
=
new
QuickSortSet
();
this
.
pathMap
=
new
HashMap
<
Integer
,
LexemePath
>();
this
.
results
=
new
LinkedList
<
Lexeme
>();
}
int
getCursor
(){
return
this
.
cursor
;
}
//
// void setCursor(int cursor){
// this.cursor = cursor;
// }
char
[]
getSegmentBuff
(){
return
this
.
segmentBuff
;
}
char
getCurrentChar
(){
return
this
.
segmentBuff
[
this
.
cursor
];
}
int
getCurrentCharType
(){
return
this
.
charTypes
[
this
.
cursor
];
}
int
getBufferOffset
(){
return
this
.
buffOffset
;
}
/**
* 根据context的上下文情况,填充segmentBuff
* @param reader
* @return 返回待分析的(有效的)字串长度
* @throws IOException
*/
int
fillBuffer
(
Reader
reader
)
throws
IOException
{
int
readCount
=
0
;
if
(
this
.
buffOffset
==
0
){
//首次读取reader
readCount
=
reader
.
read
(
segmentBuff
);
}
else
{
int
offset
=
this
.
available
-
this
.
cursor
;
if
(
offset
>
0
){
//最近一次读取的>最近一次处理的,将未处理的字串拷贝到segmentBuff头部
System
.
arraycopy
(
this
.
segmentBuff
,
this
.
cursor
,
this
.
segmentBuff
,
0
,
offset
);
readCount
=
offset
;
}
//继续读取reader ,以onceReadIn - onceAnalyzed为起始位置,继续填充segmentBuff剩余的部分
readCount
+=
reader
.
read
(
this
.
segmentBuff
,
offset
,
BUFF_SIZE
-
offset
);
}
//记录最后一次从Reader中读入的可用字符长度
this
.
available
=
readCount
;
//重置当前指针
this
.
cursor
=
0
;
return
readCount
;
}
/**
* 初始化buff指针,处理第一个字符
*/
void
initCursor
(){
this
.
cursor
=
0
;
this
.
segmentBuff
[
this
.
cursor
]
=
CharacterUtil
.
regularize
(
this
.
segmentBuff
[
this
.
cursor
]);
this
.
charTypes
[
this
.
cursor
]
=
CharacterUtil
.
identifyCharType
(
this
.
segmentBuff
[
this
.
cursor
]);
}
/**
* 指针+1
* 成功返回 true; 指针已经到了buff尾部,不能前进,返回false
* 并处理当前字符
*/
boolean
moveCursor
(){
if
(
this
.
cursor
<
this
.
available
-
1
){
this
.
cursor
++;
this
.
segmentBuff
[
this
.
cursor
]
=
CharacterUtil
.
regularize
(
this
.
segmentBuff
[
this
.
cursor
]);
this
.
charTypes
[
this
.
cursor
]
=
CharacterUtil
.
identifyCharType
(
this
.
segmentBuff
[
this
.
cursor
]);
return
true
;
}
else
{
return
false
;
}
}
/**
* 设置当前segmentBuff为锁定状态
* 加入占用segmentBuff的子分词器名称,表示占用segmentBuff
* @param segmenterName
*/
void
lockBuffer
(
String
segmenterName
){
this
.
buffLocker
.
add
(
segmenterName
);
}
/**
* 移除指定的子分词器名,释放对segmentBuff的占用
* @param segmenterName
*/
void
unlockBuffer
(
String
segmenterName
){
this
.
buffLocker
.
remove
(
segmenterName
);
}
/**
* 只要buffLocker中存在segmenterName
* 则buffer被锁定
* @return boolean 缓冲去是否被锁定
*/
boolean
isBufferLocked
(){
return
this
.
buffLocker
.
size
()
>
0
;
}
/**
* 判断当前segmentBuff是否已经用完
* 当前执针cursor移至segmentBuff末端this.available - 1
* @return
*/
boolean
isBufferConsumed
(){
return
this
.
cursor
==
this
.
available
-
1
;
}
/**
* 判断segmentBuff是否需要读取新数据
*
* 满足一下条件时,
* 1.available == BUFF_SIZE 表示buffer满载
* 2.buffIndex < available - 1 && buffIndex > available - BUFF_EXHAUST_CRITICAL表示当前指针处于临界区内
* 3.!context.isBufferLocked()表示没有segmenter在占用buffer
* 要中断当前循环(buffer要进行移位,并再读取数据的操作)
* @return
*/
boolean
needRefillBuffer
(){
return
this
.
available
==
BUFF_SIZE
&&
this
.
cursor
<
this
.
available
-
1
&&
this
.
cursor
>
this
.
available
-
BUFF_EXHAUST_CRITICAL
&&
!
this
.
isBufferLocked
();
}
/**
* 累计当前的segmentBuff相对于reader起始位置的位移
*/
void
markBufferOffset
(){
this
.
buffOffset
+=
this
.
cursor
;
}
/**
* 向分词结果集添加词元
* @param lexeme
*/
void
addLexeme
(
Lexeme
lexeme
){
this
.
orgLexemes
.
addLexeme
(
lexeme
);
}
/**
* 添加分词结果路径
* 路径起始位置 ---> 路径 映射表
* @param path
*/
void
addLexemePath
(
LexemePath
path
){
if
(
path
!=
null
){
this
.
pathMap
.
put
(
path
.
getPathBegin
(),
path
);
}
}
/**
* 返回原始分词结果
* @return
*/
QuickSortSet
getOrgLexemes
(){
return
this
.
orgLexemes
;
}
/**
* 推送分词结果到结果集合
* 1.从buff头部遍历到this.cursor已处理位置
* 2.将map中存在的分词结果推入results
* 3.将map中不存在的CJDK字符以单字方式推入results
*/
void
outputToResult
(){
int
index
=
0
;
for
(
;
index
<=
this
.
cursor
;){
//跳过非CJK字符
if
(
CharacterUtil
.
CHAR_USELESS
==
this
.
charTypes
[
index
]){
index
++;
continue
;
}
//从pathMap找出对应index位置的LexemePath
LexemePath
path
=
this
.
pathMap
.
get
(
index
);
if
(
path
!=
null
){
//输出LexemePath中的lexeme到results集合
Lexeme
l
=
path
.
pollFirst
();
while
(
l
!=
null
){
this
.
results
.
add
(
l
);
//将index移至lexeme后
index
=
l
.
getBegin
()
+
l
.
getLength
();
l
=
path
.
pollFirst
();
if
(
l
!=
null
){
//输出path内部,词元间遗漏的单字
for
(;
index
<
l
.
getBegin
();
index
++){
this
.
outputSingleCJK
(
index
);
}
}
}
}
else
{
//pathMap中找不到index对应的LexemePath
//单字输出
this
.
outputSingleCJK
(
index
);
index
++;
}
}
//清空当前的Map
this
.
pathMap
.
clear
();
}
/**
* 对CJK字符进行单字输出
* @param index
*/
private
void
outputSingleCJK
(
int
index
){
if
(
CharacterUtil
.
CHAR_CHINESE
==
this
.
charTypes
[
index
]){
Lexeme
singleCharLexeme
=
new
Lexeme
(
this
.
buffOffset
,
index
,
1
,
Lexeme
.
TYPE_CNCHAR
);
this
.
results
.
add
(
singleCharLexeme
);
}
else
if
(
CharacterUtil
.
CHAR_OTHER_CJK
==
this
.
charTypes
[
index
]){
Lexeme
singleCharLexeme
=
new
Lexeme
(
this
.
buffOffset
,
index
,
1
,
Lexeme
.
TYPE_OTHER_CJK
);
this
.
results
.
add
(
singleCharLexeme
);
}
}
/**
* 返回lexeme
*
* 同时处理合并
* @return
*/
Lexeme
getNextLexeme
(){
//从结果集取出,并移除第一个Lexme
Lexeme
result
=
this
.
results
.
pollFirst
();
while
(
result
!=
null
){
//数量词合并
this
.
compound
(
result
);
if
(
Dictionary
.
isStopWord
(
this
.
segmentBuff
,
result
.
getBegin
()
,
result
.
getLength
())){
//是停止词继续取列表的下一个
result
=
this
.
results
.
pollFirst
();
}
else
{
//不是停止词, 生成lexeme的词元文本,输出
result
.
setLexemeText
(
String
.
valueOf
(
segmentBuff
,
result
.
getBegin
()
,
result
.
getLength
()));
break
;
}
}
return
result
;
}
/**
* 重置分词上下文状态
*/
void
reset
(){
this
.
buffLocker
.
clear
();
this
.
orgLexemes
=
new
QuickSortSet
();
this
.
available
=
0
;
this
.
buffOffset
=
0
;
this
.
charTypes
=
new
int
[
BUFF_SIZE
];
this
.
cursor
=
0
;
this
.
results
.
clear
();
this
.
segmentBuff
=
new
char
[
BUFF_SIZE
];
this
.
pathMap
.
clear
();
}
/**
* 组合词元
*/
private
void
compound
(
Lexeme
result
){
if
(!
this
.
useSmart
){
return
;
}
//数量词合并处理
if
(!
this
.
results
.
isEmpty
()){
if
(
Lexeme
.
TYPE_ARABIC
==
result
.
getLexemeType
()){
Lexeme
nextLexeme
=
this
.
results
.
peekFirst
();
boolean
appendOk
=
false
;
if
(
Lexeme
.
TYPE_CNUM
==
nextLexeme
.
getLexemeType
()){
//合并英文数词+中文数词
appendOk
=
result
.
append
(
nextLexeme
,
Lexeme
.
TYPE_CNUM
);
}
else
if
(
Lexeme
.
TYPE_COUNT
==
nextLexeme
.
getLexemeType
()){
//合并英文数词+中文量词
appendOk
=
result
.
append
(
nextLexeme
,
Lexeme
.
TYPE_CQUAN
);
}
if
(
appendOk
){
//弹出
this
.
results
.
pollFirst
();
}
}
//可能存在第二轮合并
if
(
Lexeme
.
TYPE_CNUM
==
result
.
getLexemeType
()
&&
!
this
.
results
.
isEmpty
()){
Lexeme
nextLexeme
=
this
.
results
.
peekFirst
();
boolean
appendOk
=
false
;
if
(
Lexeme
.
TYPE_COUNT
==
nextLexeme
.
getLexemeType
()){
//合并中文数词+中文量词
appendOk
=
result
.
append
(
nextLexeme
,
Lexeme
.
TYPE_CQUAN
);
}
if
(
appendOk
){
//弹出
this
.
results
.
pollFirst
();
}
}
}
}
}
src/main/java/org/wltea/analyzer/core/CJKSegmenter.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
import
org.wltea.analyzer.dic.Dictionary
;
import
org.wltea.analyzer.dic.Hit
;
import
java.util.LinkedList
;
import
java.util.List
;
/**
* 中文-日韩文子分词器
*/
class
CJKSegmenter
implements
ISegmenter
{
//子分词器标签
static
final
String
SEGMENTER_NAME
=
"CJK_SEGMENTER"
;
//待处理的分词hit队列
private
List
<
Hit
>
tmpHits
;
CJKSegmenter
(){
this
.
tmpHits
=
new
LinkedList
<
Hit
>();
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
*/
public
void
analyze
(
AnalyzeContext
context
)
{
if
(
CharacterUtil
.
CHAR_USELESS
!=
context
.
getCurrentCharType
()){
//优先处理tmpHits中的hit
if
(!
this
.
tmpHits
.
isEmpty
()){
//处理词段队列
Hit
[]
tmpArray
=
this
.
tmpHits
.
toArray
(
new
Hit
[
this
.
tmpHits
.
size
()]);
for
(
Hit
hit
:
tmpArray
){
hit
=
Dictionary
.
matchInMainDictWithHit
(
context
.
getSegmentBuff
(),
context
.
getCursor
()
,
hit
);
if
(
hit
.
isMatch
()){
//输出当前的词
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
hit
.
getBegin
()
,
context
.
getCursor
()
-
hit
.
getBegin
()
+
1
,
Lexeme
.
TYPE_CNWORD
);
context
.
addLexeme
(
newLexeme
);
if
(!
hit
.
isPrefix
()){
//不是词前缀,hit不需要继续匹配,移除
this
.
tmpHits
.
remove
(
hit
);
}
}
else
if
(
hit
.
isUnmatch
()){
//hit不是词,移除
this
.
tmpHits
.
remove
(
hit
);
}
}
}
//*********************************
//再对当前指针位置的字符进行单字匹配
Hit
singleCharHit
=
Dictionary
.
matchInMainDict
(
context
.
getSegmentBuff
(),
context
.
getCursor
(),
1
);
if
(
singleCharHit
.
isMatch
()){
//首字成词
//输出当前的词
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
context
.
getCursor
()
,
1
,
Lexeme
.
TYPE_CNWORD
);
context
.
addLexeme
(
newLexeme
);
//同时也是词前缀
if
(
singleCharHit
.
isPrefix
()){
//前缀匹配则放入hit列表
this
.
tmpHits
.
add
(
singleCharHit
);
}
}
else
if
(
singleCharHit
.
isPrefix
()){
//首字为词前缀
//前缀匹配则放入hit列表
this
.
tmpHits
.
add
(
singleCharHit
);
}
}
else
{
//遇到CHAR_USELESS字符
//清空队列
this
.
tmpHits
.
clear
();
}
//判断缓冲区是否已经读完
if
(
context
.
isBufferConsumed
()){
//清空队列
this
.
tmpHits
.
clear
();
}
//判断是否锁定缓冲区
if
(
this
.
tmpHits
.
size
()
==
0
){
context
.
unlockBuffer
(
SEGMENTER_NAME
);
}
else
{
context
.
lockBuffer
(
SEGMENTER_NAME
);
}
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#reset()
*/
public
void
reset
()
{
//清空队列
this
.
tmpHits
.
clear
();
}
}
src/main/java/org/wltea/analyzer/core/CN_QuantifierSegmenter.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
import
org.wltea.analyzer.dic.Dictionary
;
import
org.wltea.analyzer.dic.Hit
;
import
java.util.HashSet
;
import
java.util.LinkedList
;
import
java.util.List
;
import
java.util.Set
;
/**
*
* 中文数量词子分词器
*/
class
CN_QuantifierSegmenter
implements
ISegmenter
{
//子分词器标签
static
final
String
SEGMENTER_NAME
=
"QUAN_SEGMENTER"
;
//中文数词
private
static
String
Chn_Num
=
"一二两三四五六七八九十零壹贰叁肆伍陆柒捌玖拾百千万亿拾佰仟萬億兆卅廿"
;
//Cnum
private
static
Set
<
Character
>
ChnNumberChars
=
new
HashSet
<
Character
>();
static
{
char
[]
ca
=
Chn_Num
.
toCharArray
();
for
(
char
nChar
:
ca
){
ChnNumberChars
.
add
(
nChar
);
}
}
/*
* 词元的开始位置,
* 同时作为子分词器状态标识
* 当start > -1 时,标识当前的分词器正在处理字符
*/
private
int
nStart
;
/*
* 记录词元结束位置
* end记录的是在词元中最后一个出现的合理的数词结束
*/
private
int
nEnd
;
//待处理的量词hit队列
private
List
<
Hit
>
countHits
;
CN_QuantifierSegmenter
(){
nStart
=
-
1
;
nEnd
=
-
1
;
this
.
countHits
=
new
LinkedList
<
Hit
>();
}
/**
* 分词
*/
public
void
analyze
(
AnalyzeContext
context
)
{
//处理中文数词
this
.
processCNumber
(
context
);
//处理中文量词
this
.
processCount
(
context
);
//判断是否锁定缓冲区
if
(
this
.
nStart
==
-
1
&&
this
.
nEnd
==
-
1
&&
countHits
.
isEmpty
()){
//对缓冲区解锁
context
.
unlockBuffer
(
SEGMENTER_NAME
);
}
else
{
context
.
lockBuffer
(
SEGMENTER_NAME
);
}
}
/**
* 重置子分词器状态
*/
public
void
reset
()
{
nStart
=
-
1
;
nEnd
=
-
1
;
countHits
.
clear
();
}
/**
* 处理数词
*/
private
void
processCNumber
(
AnalyzeContext
context
){
if
(
nStart
==
-
1
&&
nEnd
==
-
1
){
//初始状态
if
(
CharacterUtil
.
CHAR_CHINESE
==
context
.
getCurrentCharType
()
&&
ChnNumberChars
.
contains
(
context
.
getCurrentChar
())){
//记录数词的起始、结束位置
nStart
=
context
.
getCursor
();
nEnd
=
context
.
getCursor
();
}
}
else
{
//正在处理状态
if
(
CharacterUtil
.
CHAR_CHINESE
==
context
.
getCurrentCharType
()
&&
ChnNumberChars
.
contains
(
context
.
getCurrentChar
())){
//记录数词的结束位置
nEnd
=
context
.
getCursor
();
}
else
{
//输出数词
this
.
outputNumLexeme
(
context
);
//重置头尾指针
nStart
=
-
1
;
nEnd
=
-
1
;
}
}
//缓冲区已经用完,还有尚未输出的数词
if
(
context
.
isBufferConsumed
()){
if
(
nStart
!=
-
1
&&
nEnd
!=
-
1
){
//输出数词
outputNumLexeme
(
context
);
//重置头尾指针
nStart
=
-
1
;
nEnd
=
-
1
;
}
}
}
/**
* 处理中文量词
* @param context
*/
private
void
processCount
(
AnalyzeContext
context
){
// 判断是否需要启动量词扫描
if
(!
this
.
needCountScan
(
context
)){
return
;
}
if
(
CharacterUtil
.
CHAR_CHINESE
==
context
.
getCurrentCharType
()){
//优先处理countHits中的hit
if
(!
this
.
countHits
.
isEmpty
()){
//处理词段队列
Hit
[]
tmpArray
=
this
.
countHits
.
toArray
(
new
Hit
[
this
.
countHits
.
size
()]);
for
(
Hit
hit
:
tmpArray
){
hit
=
Dictionary
.
matchInMainDictWithHit
(
context
.
getSegmentBuff
(),
context
.
getCursor
()
,
hit
);
if
(
hit
.
isMatch
()){
//输出当前的词
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
hit
.
getBegin
()
,
context
.
getCursor
()
-
hit
.
getBegin
()
+
1
,
Lexeme
.
TYPE_COUNT
);
context
.
addLexeme
(
newLexeme
);
if
(!
hit
.
isPrefix
()){
//不是词前缀,hit不需要继续匹配,移除
this
.
countHits
.
remove
(
hit
);
}
}
else
if
(
hit
.
isUnmatch
()){
//hit不是词,移除
this
.
countHits
.
remove
(
hit
);
}
}
}
//*********************************
//对当前指针位置的字符进行单字匹配
Hit
singleCharHit
=
Dictionary
.
matchInQuantifierDict
(
context
.
getSegmentBuff
(),
context
.
getCursor
(),
1
);
if
(
singleCharHit
.
isMatch
()){
//首字成量词词
//输出当前的词
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
context
.
getCursor
()
,
1
,
Lexeme
.
TYPE_COUNT
);
context
.
addLexeme
(
newLexeme
);
//同时也是词前缀
if
(
singleCharHit
.
isPrefix
()){
//前缀匹配则放入hit列表
this
.
countHits
.
add
(
singleCharHit
);
}
}
else
if
(
singleCharHit
.
isPrefix
()){
//首字为量词前缀
//前缀匹配则放入hit列表
this
.
countHits
.
add
(
singleCharHit
);
}
}
else
{
//输入的不是中文字符
//清空未成形的量词
this
.
countHits
.
clear
();
}
//缓冲区数据已经读完,还有尚未输出的量词
if
(
context
.
isBufferConsumed
()){
//清空未成形的量词
this
.
countHits
.
clear
();
}
}
/**
* 判断是否需要扫描量词
* @return
*/
private
boolean
needCountScan
(
AnalyzeContext
context
){
if
((
nStart
!=
-
1
&&
nEnd
!=
-
1
)
||
!
countHits
.
isEmpty
()){
//正在处理中文数词,或者正在处理量词
return
true
;
}
else
{
//找到一个相邻的数词
if
(!
context
.
getOrgLexemes
().
isEmpty
()){
Lexeme
l
=
context
.
getOrgLexemes
().
peekLast
();
if
(
Lexeme
.
TYPE_CNUM
==
l
.
getLexemeType
()
||
Lexeme
.
TYPE_ARABIC
==
l
.
getLexemeType
()){
if
(
l
.
getBegin
()
+
l
.
getLength
()
==
context
.
getCursor
()){
return
true
;
}
}
}
}
return
false
;
}
/**
* 添加数词词元到结果集
* @param context
*/
private
void
outputNumLexeme
(
AnalyzeContext
context
){
if
(
nStart
>
-
1
&&
nEnd
>
-
1
){
//输出数词
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
nStart
,
nEnd
-
nStart
+
1
,
Lexeme
.
TYPE_CNUM
);
context
.
addLexeme
(
newLexeme
);
}
}
}
src/main/java/org/wltea/analyzer/core/CharacterUtil.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
* 字符集识别工具类
*/
package
org.wltea.analyzer.core
;
/**
*
* 字符集识别工具类
*/
class
CharacterUtil
{
public
static
final
int
CHAR_USELESS
=
0
;
public
static
final
int
CHAR_ARABIC
=
0X00000001
;
public
static
final
int
CHAR_ENGLISH
=
0X00000002
;
public
static
final
int
CHAR_CHINESE
=
0X00000004
;
public
static
final
int
CHAR_OTHER_CJK
=
0X00000008
;
/**
* 识别字符类型
* @param input
* @return int CharacterUtil定义的字符类型常量
*/
static
int
identifyCharType
(
char
input
){
if
(
input
>=
'0'
&&
input
<=
'9'
){
return
CHAR_ARABIC
;
}
else
if
((
input
>=
'a'
&&
input
<=
'z'
)
||
(
input
>=
'A'
&&
input
<=
'Z'
)){
return
CHAR_ENGLISH
;
}
else
{
Character
.
UnicodeBlock
ub
=
Character
.
UnicodeBlock
.
of
(
input
);
if
(
ub
==
Character
.
UnicodeBlock
.
CJK_UNIFIED_IDEOGRAPHS
||
ub
==
Character
.
UnicodeBlock
.
CJK_COMPATIBILITY_IDEOGRAPHS
||
ub
==
Character
.
UnicodeBlock
.
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
){
//目前已知的中文字符UTF-8集合
return
CHAR_CHINESE
;
}
else
if
(
ub
==
Character
.
UnicodeBlock
.
HALFWIDTH_AND_FULLWIDTH_FORMS
//全角数字字符和日韩字符
//韩文字符集
||
ub
==
Character
.
UnicodeBlock
.
HANGUL_SYLLABLES
||
ub
==
Character
.
UnicodeBlock
.
HANGUL_JAMO
||
ub
==
Character
.
UnicodeBlock
.
HANGUL_COMPATIBILITY_JAMO
//日文字符集
||
ub
==
Character
.
UnicodeBlock
.
HIRAGANA
//平假名
||
ub
==
Character
.
UnicodeBlock
.
KATAKANA
//片假名
||
ub
==
Character
.
UnicodeBlock
.
KATAKANA_PHONETIC_EXTENSIONS
){
return
CHAR_OTHER_CJK
;
}
}
//其他的不做处理的字符
return
CHAR_USELESS
;
}
/**
* 进行字符规格化(全角转半角,大写转小写处理)
* @param input
* @return char
*/
static
char
regularize
(
char
input
){
if
(
input
==
12288
)
{
input
=
(
char
)
32
;
}
else
if
(
input
>
65280
&&
input
<
65375
)
{
input
=
(
char
)
(
input
-
65248
);
}
else
if
(
input
>=
'A'
&&
input
<=
'Z'
)
{
input
+=
32
;
}
return
input
;
}
}
src/main/java/org/wltea/analyzer/core/IKArbitrator.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
import
java.util.Stack
;
import
java.util.TreeSet
;
/**
* IK分词歧义裁决器
*/
class
IKArbitrator
{
IKArbitrator
(){
}
/**
* 分词歧义处理
* @param orgLexemes
* @param useSmart
*/
void
process
(
AnalyzeContext
context
,
boolean
useSmart
){
QuickSortSet
orgLexemes
=
context
.
getOrgLexemes
();
Lexeme
orgLexeme
=
orgLexemes
.
pollFirst
();
LexemePath
crossPath
=
new
LexemePath
();
while
(
orgLexeme
!=
null
){
if
(!
crossPath
.
addCrossLexeme
(
orgLexeme
)){
//找到与crossPath不相交的下一个crossPath
if
(
crossPath
.
size
()
==
1
||
!
useSmart
){
//crossPath没有歧义 或者 不做歧义处理
//直接输出当前crossPath
context
.
addLexemePath
(
crossPath
);
}
else
{
//对当前的crossPath进行歧义处理
QuickSortSet
.
Cell
headCell
=
crossPath
.
getHead
();
LexemePath
judgeResult
=
this
.
judge
(
headCell
,
crossPath
.
getPathLength
());
//输出歧义处理结果judgeResult
context
.
addLexemePath
(
judgeResult
);
}
//把orgLexeme加入新的crossPath中
crossPath
=
new
LexemePath
();
crossPath
.
addCrossLexeme
(
orgLexeme
);
}
orgLexeme
=
orgLexemes
.
pollFirst
();
}
//处理最后的path
if
(
crossPath
.
size
()
==
1
||
!
useSmart
){
//crossPath没有歧义 或者 不做歧义处理
//直接输出当前crossPath
context
.
addLexemePath
(
crossPath
);
}
else
{
//对当前的crossPath进行歧义处理
QuickSortSet
.
Cell
headCell
=
crossPath
.
getHead
();
LexemePath
judgeResult
=
this
.
judge
(
headCell
,
crossPath
.
getPathLength
());
//输出歧义处理结果judgeResult
context
.
addLexemePath
(
judgeResult
);
}
}
/**
* 歧义识别
* @param lexemeCell 歧义路径链表头
* @param fullTextLength 歧义路径文本长度
* @param option 候选结果路径
* @return
*/
private
LexemePath
judge
(
QuickSortSet
.
Cell
lexemeCell
,
int
fullTextLength
){
//候选路径集合
TreeSet
<
LexemePath
>
pathOptions
=
new
TreeSet
<
LexemePath
>();
//候选结果路径
LexemePath
option
=
new
LexemePath
();
//对crossPath进行一次遍历,同时返回本次遍历中有冲突的Lexeme栈
Stack
<
QuickSortSet
.
Cell
>
lexemeStack
=
this
.
forwardPath
(
lexemeCell
,
option
);
//当前词元链并非最理想的,加入候选路径集合
pathOptions
.
add
(
option
.
copy
());
//存在歧义词,处理
QuickSortSet
.
Cell
c
=
null
;
while
(!
lexemeStack
.
isEmpty
()){
c
=
lexemeStack
.
pop
();
//回滚词元链
this
.
backPath
(
c
.
getLexeme
()
,
option
);
//从歧义词位置开始,递归,生成可选方案
this
.
forwardPath
(
c
,
option
);
pathOptions
.
add
(
option
.
copy
());
}
//返回集合中的最优方案
return
pathOptions
.
first
();
}
/**
* 向前遍历,添加词元,构造一个无歧义词元组合
* @param LexemePath path
* @return
*/
private
Stack
<
QuickSortSet
.
Cell
>
forwardPath
(
QuickSortSet
.
Cell
lexemeCell
,
LexemePath
option
){
//发生冲突的Lexeme栈
Stack
<
QuickSortSet
.
Cell
>
conflictStack
=
new
Stack
<
QuickSortSet
.
Cell
>();
QuickSortSet
.
Cell
c
=
lexemeCell
;
//迭代遍历Lexeme链表
while
(
c
!=
null
&&
c
.
getLexeme
()
!=
null
){
if
(!
option
.
addNotCrossLexeme
(
c
.
getLexeme
())){
//词元交叉,添加失败则加入lexemeStack栈
conflictStack
.
push
(
c
);
}
c
=
c
.
getNext
();
}
return
conflictStack
;
}
/**
* 回滚词元链,直到它能够接受指定的词元
* @param lexeme
* @param l
*/
private
void
backPath
(
Lexeme
l
,
LexemePath
option
){
while
(
option
.
checkCross
(
l
)){
option
.
removeTail
();
}
}
}
src/main/java/org/wltea/analyzer/core/IKSegmenter.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*/
package
org.wltea.analyzer.core
;
import
org.elasticsearch.common.logging.ESLogger
;
import
org.elasticsearch.common.logging.Loggers
;
import
java.io.IOException
;
import
java.io.Reader
;
import
java.util.ArrayList
;
import
java.util.List
;
/**
* IK分词器主类
*
*/
public
final
class
IKSegmenter
{
//字符窜reader
private
Reader
input
;
//分词器上下文
private
AnalyzeContext
context
;
//分词处理器列表
private
List
<
ISegmenter
>
segmenters
;
//分词歧义裁决器
private
IKArbitrator
arbitrator
;
private
ESLogger
logger
=
null
;
private
final
boolean
useSmart
;
/**
* IK分词器构造函数
* @param input
* @param useSmart 为true,使用智能分词策略
*
* 非智能分词:细粒度输出所有可能的切分结果
* 智能分词: 合并数词和量词,对分词结果进行歧义判断
*/
public
IKSegmenter
(
Reader
input
,
boolean
useSmart
){
logger
=
Loggers
.
getLogger
(
"ik-analyzer"
);
this
.
input
=
input
;
this
.
useSmart
=
useSmart
;
this
.
init
();
}
/**
* 初始化
*/
private
void
init
(){
//初始化分词上下文
this
.
context
=
new
AnalyzeContext
(
useSmart
);
//加载子分词器
this
.
segmenters
=
this
.
loadSegmenters
();
//加载歧义裁决器
this
.
arbitrator
=
new
IKArbitrator
();
}
/**
* 初始化词典,加载子分词器实现
* @return List<ISegmenter>
*/
private
List
<
ISegmenter
>
loadSegmenters
(){
List
<
ISegmenter
>
segmenters
=
new
ArrayList
<
ISegmenter
>(
4
);
//处理字母的子分词器
segmenters
.
add
(
new
LetterSegmenter
());
//处理中文数量词的子分词器
segmenters
.
add
(
new
CN_QuantifierSegmenter
());
//处理中文词的子分词器
segmenters
.
add
(
new
CJKSegmenter
());
return
segmenters
;
}
/**
* 分词,获取下一个词元
* @return Lexeme 词元对象
* @throws IOException
*/
public
synchronized
Lexeme
next
()
throws
IOException
{
Lexeme
l
=
null
;
while
((
l
=
context
.
getNextLexeme
())
==
null
){
/*
* 从reader中读取数据,填充buffer
* 如果reader是分次读入buffer的,那么buffer要 进行移位处理
* 移位处理上次读入的但未处理的数据
*/
int
available
=
context
.
fillBuffer
(
this
.
input
);
if
(
available
<=
0
){
//reader已经读完
context
.
reset
();
return
null
;
}
else
{
//初始化指针
context
.
initCursor
();
do
{
//遍历子分词器
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
analyze
(
context
);
}
//字符缓冲区接近读完,需要读入新的字符
if
(
context
.
needRefillBuffer
()){
break
;
}
//向前移动指针
}
while
(
context
.
moveCursor
());
//重置子分词器,为下轮循环进行初始化
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
reset
();
}
}
//对分词进行歧义处理
logger
.
error
(
"useSmart:"
+
String
.
valueOf
(
useSmart
));
this
.
arbitrator
.
process
(
context
,
useSmart
);
//将分词结果输出到结果集,并处理未切分的单个CJK字符
context
.
outputToResult
();
//记录本次分词的缓冲区位移
context
.
markBufferOffset
();
}
return
l
;
}
/**
* 重置分词器到初始状态
* @param input
*/
public
synchronized
void
reset
(
Reader
input
)
{
this
.
input
=
input
;
context
.
reset
();
for
(
ISegmenter
segmenter
:
segmenters
){
segmenter
.
reset
();
}
}
}
src/main/java/org/wltea/analyzer/core/ISegmenter.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
/**
*
* 子分词器接口
*/
interface
ISegmenter
{
/**
* 从分析器读取下一个可能分解的词元对象
* @param context 分词算法上下文
*/
void
analyze
(
AnalyzeContext
context
);
/**
* 重置子分析器状态
*/
void
reset
();
}
src/main/java/org/wltea/analyzer/core/LetterSegmenter.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
import
java.util.Arrays
;
/**
*
* 英文字符及阿拉伯数字子分词器
*/
class
LetterSegmenter
implements
ISegmenter
{
//子分词器标签
static
final
String
SEGMENTER_NAME
=
"LETTER_SEGMENTER"
;
//链接符号
private
static
final
char
[]
Letter_Connector
=
new
char
[]{
'#'
,
'&'
,
'+'
,
'-'
,
'.'
,
'@'
,
'_'
};
//数字符号
private
static
final
char
[]
Num_Connector
=
new
char
[]{
','
,
'.'
};
/*
* 词元的开始位置,
* 同时作为子分词器状态标识
* 当start > -1 时,标识当前的分词器正在处理字符
*/
private
int
start
;
/*
* 记录词元结束位置
* end记录的是在词元中最后一个出现的Letter但非Sign_Connector的字符的位置
*/
private
int
end
;
/*
* 字母起始位置
*/
private
int
englishStart
;
/*
* 字母结束位置
*/
private
int
englishEnd
;
/*
* 阿拉伯数字起始位置
*/
private
int
arabicStart
;
/*
* 阿拉伯数字结束位置
*/
private
int
arabicEnd
;
LetterSegmenter
(){
Arrays
.
sort
(
Letter_Connector
);
Arrays
.
sort
(
Num_Connector
);
this
.
start
=
-
1
;
this
.
end
=
-
1
;
this
.
englishStart
=
-
1
;
this
.
englishEnd
=
-
1
;
this
.
arabicStart
=
-
1
;
this
.
arabicEnd
=
-
1
;
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
*/
public
void
analyze
(
AnalyzeContext
context
)
{
boolean
bufferLockFlag
=
false
;
//处理英文字母
bufferLockFlag
=
this
.
processEnglishLetter
(
context
)
||
bufferLockFlag
;
//处理阿拉伯字母
bufferLockFlag
=
this
.
processArabicLetter
(
context
)
||
bufferLockFlag
;
//处理混合字母(这个要放最后处理,可以通过QuickSortSet排除重复)
bufferLockFlag
=
this
.
processMixLetter
(
context
)
||
bufferLockFlag
;
//判断是否锁定缓冲区
if
(
bufferLockFlag
){
context
.
lockBuffer
(
SEGMENTER_NAME
);
}
else
{
//对缓冲区解锁
context
.
unlockBuffer
(
SEGMENTER_NAME
);
}
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#reset()
*/
public
void
reset
()
{
this
.
start
=
-
1
;
this
.
end
=
-
1
;
this
.
englishStart
=
-
1
;
this
.
englishEnd
=
-
1
;
this
.
arabicStart
=
-
1
;
this
.
arabicEnd
=
-
1
;
}
/**
* 处理数字字母混合输出
* 如:windos2000 | linliangyi2005@gmail.com
* @param input
* @param context
* @return
*/
private
boolean
processMixLetter
(
AnalyzeContext
context
){
boolean
needLock
=
false
;
if
(
this
.
start
==
-
1
){
//当前的分词器尚未开始处理字符
if
(
CharacterUtil
.
CHAR_ARABIC
==
context
.
getCurrentCharType
()
||
CharacterUtil
.
CHAR_ENGLISH
==
context
.
getCurrentCharType
()){
//记录起始指针的位置,标明分词器进入处理状态
this
.
start
=
context
.
getCursor
();
this
.
end
=
start
;
}
}
else
{
//当前的分词器正在处理字符
if
(
CharacterUtil
.
CHAR_ARABIC
==
context
.
getCurrentCharType
()
||
CharacterUtil
.
CHAR_ENGLISH
==
context
.
getCurrentCharType
()){
//记录下可能的结束位置
this
.
end
=
context
.
getCursor
();
}
else
if
(
CharacterUtil
.
CHAR_USELESS
==
context
.
getCurrentCharType
()
&&
this
.
isLetterConnector
(
context
.
getCurrentChar
())){
//记录下可能的结束位置
this
.
end
=
context
.
getCursor
();
}
else
{
//遇到非Letter字符,输出词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
start
,
this
.
end
-
this
.
start
+
1
,
Lexeme
.
TYPE_LETTER
);
context
.
addLexeme
(
newLexeme
);
this
.
start
=
-
1
;
this
.
end
=
-
1
;
}
}
//判断缓冲区是否已经读完
if
(
context
.
isBufferConsumed
()){
if
(
this
.
start
!=
-
1
&&
this
.
end
!=
-
1
){
//缓冲以读完,输出词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
start
,
this
.
end
-
this
.
start
+
1
,
Lexeme
.
TYPE_LETTER
);
context
.
addLexeme
(
newLexeme
);
this
.
start
=
-
1
;
this
.
end
=
-
1
;
}
}
//判断是否锁定缓冲区
if
(
this
.
start
==
-
1
&&
this
.
end
==
-
1
){
//对缓冲区解锁
needLock
=
false
;
}
else
{
needLock
=
true
;
}
return
needLock
;
}
/**
* 处理纯英文字母输出
* @param context
* @return
*/
private
boolean
processEnglishLetter
(
AnalyzeContext
context
){
boolean
needLock
=
false
;
if
(
this
.
englishStart
==
-
1
){
//当前的分词器尚未开始处理英文字符
if
(
CharacterUtil
.
CHAR_ENGLISH
==
context
.
getCurrentCharType
()){
//记录起始指针的位置,标明分词器进入处理状态
this
.
englishStart
=
context
.
getCursor
();
this
.
englishEnd
=
this
.
englishStart
;
}
}
else
{
//当前的分词器正在处理英文字符
if
(
CharacterUtil
.
CHAR_ENGLISH
==
context
.
getCurrentCharType
()){
//记录当前指针位置为结束位置
this
.
englishEnd
=
context
.
getCursor
();
}
else
{
//遇到非English字符,输出词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
englishStart
,
this
.
englishEnd
-
this
.
englishStart
+
1
,
Lexeme
.
TYPE_ENGLISH
);
context
.
addLexeme
(
newLexeme
);
this
.
englishStart
=
-
1
;
this
.
englishEnd
=
-
1
;
}
}
//判断缓冲区是否已经读完
if
(
context
.
isBufferConsumed
()){
if
(
this
.
englishStart
!=
-
1
&&
this
.
englishEnd
!=
-
1
){
//缓冲以读完,输出词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
englishStart
,
this
.
englishEnd
-
this
.
englishStart
+
1
,
Lexeme
.
TYPE_ENGLISH
);
context
.
addLexeme
(
newLexeme
);
this
.
englishStart
=
-
1
;
this
.
englishEnd
=
-
1
;
}
}
//判断是否锁定缓冲区
if
(
this
.
englishStart
==
-
1
&&
this
.
englishEnd
==
-
1
){
//对缓冲区解锁
needLock
=
false
;
}
else
{
needLock
=
true
;
}
return
needLock
;
}
/**
* 处理阿拉伯数字输出
* @param context
* @return
*/
private
boolean
processArabicLetter
(
AnalyzeContext
context
){
boolean
needLock
=
false
;
if
(
this
.
arabicStart
==
-
1
){
//当前的分词器尚未开始处理数字字符
if
(
CharacterUtil
.
CHAR_ARABIC
==
context
.
getCurrentCharType
()){
//记录起始指针的位置,标明分词器进入处理状态
this
.
arabicStart
=
context
.
getCursor
();
this
.
arabicEnd
=
this
.
arabicStart
;
}
}
else
{
//当前的分词器正在处理数字字符
if
(
CharacterUtil
.
CHAR_ARABIC
==
context
.
getCurrentCharType
()){
//记录当前指针位置为结束位置
this
.
arabicEnd
=
context
.
getCursor
();
}
else
if
(
CharacterUtil
.
CHAR_USELESS
==
context
.
getCurrentCharType
()
&&
this
.
isNumConnector
(
context
.
getCurrentChar
())){
//不输出数字,但不标记结束
}
else
{
////遇到非Arabic字符,输出词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
arabicStart
,
this
.
arabicEnd
-
this
.
arabicStart
+
1
,
Lexeme
.
TYPE_ARABIC
);
context
.
addLexeme
(
newLexeme
);
this
.
arabicStart
=
-
1
;
this
.
arabicEnd
=
-
1
;
}
}
//判断缓冲区是否已经读完
if
(
context
.
isBufferConsumed
()){
if
(
this
.
arabicStart
!=
-
1
&&
this
.
arabicEnd
!=
-
1
){
//生成已切分的词元
Lexeme
newLexeme
=
new
Lexeme
(
context
.
getBufferOffset
()
,
this
.
arabicStart
,
this
.
arabicEnd
-
this
.
arabicStart
+
1
,
Lexeme
.
TYPE_ARABIC
);
context
.
addLexeme
(
newLexeme
);
this
.
arabicStart
=
-
1
;
this
.
arabicEnd
=
-
1
;
}
}
//判断是否锁定缓冲区
if
(
this
.
arabicStart
==
-
1
&&
this
.
arabicEnd
==
-
1
){
//对缓冲区解锁
needLock
=
false
;
}
else
{
needLock
=
true
;
}
return
needLock
;
}
/**
* 判断是否是字母连接符号
* @param input
* @return
*/
private
boolean
isLetterConnector
(
char
input
){
int
index
=
Arrays
.
binarySearch
(
Letter_Connector
,
input
);
return
index
>=
0
;
}
/**
* 判断是否是数字连接符号
* @param input
* @return
*/
private
boolean
isNumConnector
(
char
input
){
int
index
=
Arrays
.
binarySearch
(
Num_Connector
,
input
);
return
index
>=
0
;
}
}
src/main/java/org/wltea/analyzer/Lexeme.java
→
src/main/java/org/wltea/analyzer/
core/
Lexeme.java
浏览文件 @
7d91be50
/**
*
*/
package
org.wltea.analyzer
;
public
final
class
Lexeme
implements
Comparable
<
Lexeme
>{
public
static
final
int
TYPE_CJK_NORMAL
=
0
;
public
static
final
int
TYPE_CJK_SN
=
1
;
public
static
final
int
TYPE_CJK_SF
=
2
;
public
static
final
int
TYPE_CJK_UNKNOWN
=
3
;
public
static
final
int
TYPE_NUM
=
10
;
public
static
final
int
TYPE_NUMCOUNT
=
11
;
public
static
final
int
TYPE_LETTER
=
20
;
private
int
offset
;
private
int
begin
;
private
int
length
;
private
String
lexemeText
;
private
int
lexemeType
;
private
Lexeme
prev
;
private
Lexeme
next
;
public
Lexeme
(
int
offset
,
int
begin
,
int
length
,
int
lexemeType
){
this
.
offset
=
offset
;
this
.
begin
=
begin
;
if
(
length
<
0
){
throw
new
IllegalArgumentException
(
"length < 0"
);
}
this
.
length
=
length
;
this
.
lexemeType
=
lexemeType
;
}
public
boolean
equals
(
Object
o
){
if
(
o
==
null
){
return
false
;
}
if
(
this
==
o
){
return
true
;
}
if
(
o
instanceof
Lexeme
){
Lexeme
other
=
(
Lexeme
)
o
;
if
(
this
.
offset
==
other
.
getOffset
()
&&
this
.
begin
==
other
.
getBegin
()
&&
this
.
length
==
other
.
getLength
()){
return
true
;
}
else
{
return
false
;
}
}
else
{
return
false
;
}
}
public
int
hashCode
(){
int
absBegin
=
getBeginPosition
();
int
absEnd
=
getEndPosition
();
return
(
absBegin
*
37
)
+
(
absEnd
*
31
)
+
((
absBegin
*
absEnd
)
%
getLength
())
*
11
;
}
public
int
compareTo
(
Lexeme
other
)
{
if
(
this
.
begin
<
other
.
getBegin
()){
return
-
1
;
}
else
if
(
this
.
begin
==
other
.
getBegin
()){
if
(
this
.
length
>
other
.
getLength
()){
return
-
1
;
}
else
if
(
this
.
length
==
other
.
getLength
()){
return
0
;
}
else
{
return
1
;
}
}
else
{
return
1
;
}
}
public
boolean
isOverlap
(
Lexeme
other
){
if
(
other
!=
null
){
if
(
this
.
getBeginPosition
()
<=
other
.
getBeginPosition
()
&&
this
.
getEndPosition
()
>=
other
.
getEndPosition
()){
return
true
;
}
else
if
(
this
.
getBeginPosition
()
>=
other
.
getBeginPosition
()
&&
this
.
getEndPosition
()
<=
other
.
getEndPosition
()){
return
true
;
}
else
{
return
false
;
}
}
return
false
;
}
public
int
getOffset
()
{
return
offset
;
}
public
void
setOffset
(
int
offset
)
{
this
.
offset
=
offset
;
}
public
int
getBegin
()
{
return
begin
;
}
public
int
getBeginPosition
(){
return
offset
+
begin
;
}
public
void
setBegin
(
int
begin
)
{
this
.
begin
=
begin
;
}
public
int
getEndPosition
(){
return
offset
+
begin
+
length
;
}
public
int
getLength
(){
return
this
.
length
;
}
public
void
setLength
(
int
length
)
{
if
(
this
.
length
<
0
){
throw
new
IllegalArgumentException
(
"length < 0"
);
}
this
.
length
=
length
;
}
public
String
getLexemeText
()
{
if
(
lexemeText
==
null
){
return
""
;
}
return
lexemeText
;
}
public
void
setLexemeText
(
String
lexemeText
)
{
if
(
lexemeText
==
null
){
this
.
lexemeText
=
""
;
this
.
length
=
0
;
}
else
{
this
.
lexemeText
=
lexemeText
;
this
.
length
=
lexemeText
.
length
();
}
}
public
int
getLexemeType
()
{
return
lexemeType
;
}
public
void
setLexemeType
(
int
lexemeType
)
{
this
.
lexemeType
=
lexemeType
;
}
public
String
toString
(){
StringBuffer
strbuf
=
new
StringBuffer
();
strbuf
.
append
(
this
.
getBeginPosition
()).
append
(
"-"
).
append
(
this
.
getEndPosition
());
strbuf
.
append
(
" : "
).
append
(
this
.
lexemeText
).
append
(
" : \t"
);
switch
(
lexemeType
)
{
case
TYPE_CJK_NORMAL
:
strbuf
.
append
(
"CJK_NORMAL"
);
break
;
case
TYPE_CJK_SF
:
strbuf
.
append
(
"CJK_SUFFIX"
);
break
;
case
TYPE_CJK_SN
:
strbuf
.
append
(
"CJK_NAME"
);
break
;
case
TYPE_CJK_UNKNOWN
:
strbuf
.
append
(
"UNKNOWN"
);
break
;
case
TYPE_NUM
:
strbuf
.
append
(
"NUMEBER"
);
break
;
case
TYPE_NUMCOUNT
:
strbuf
.
append
(
"COUNT"
);
break
;
case
TYPE_LETTER
:
strbuf
.
append
(
"LETTER"
);
break
;
}
return
strbuf
.
toString
();
}
Lexeme
getPrev
()
{
return
prev
;
}
void
setPrev
(
Lexeme
prev
)
{
this
.
prev
=
prev
;
}
Lexeme
getNext
()
{
return
next
;
}
void
setNext
(
Lexeme
next
)
{
this
.
next
=
next
;
}
}
\ No newline at end of file
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
/**
* IK词元对象
*/
public
class
Lexeme
implements
Comparable
<
Lexeme
>{
//lexemeType常量
//未知
public
static
final
int
TYPE_UNKNOWN
=
0
;
//英文
public
static
final
int
TYPE_ENGLISH
=
1
;
//数字
public
static
final
int
TYPE_ARABIC
=
2
;
//英文数字混合
public
static
final
int
TYPE_LETTER
=
3
;
//中文词元
public
static
final
int
TYPE_CNWORD
=
4
;
//中文单字
public
static
final
int
TYPE_CNCHAR
=
64
;
//日韩文字
public
static
final
int
TYPE_OTHER_CJK
=
8
;
//中文数词
public
static
final
int
TYPE_CNUM
=
16
;
//中文量词
public
static
final
int
TYPE_COUNT
=
32
;
//中文数量词
public
static
final
int
TYPE_CQUAN
=
48
;
//词元的起始位移
private
int
offset
;
//词元的相对起始位置
private
int
begin
;
//词元的长度
private
int
length
;
//词元文本
private
String
lexemeText
;
//词元类型
private
int
lexemeType
;
public
Lexeme
(
int
offset
,
int
begin
,
int
length
,
int
lexemeType
){
this
.
offset
=
offset
;
this
.
begin
=
begin
;
if
(
length
<
0
){
throw
new
IllegalArgumentException
(
"length < 0"
);
}
this
.
length
=
length
;
this
.
lexemeType
=
lexemeType
;
}
/*
* 判断词元相等算法
* 起始位置偏移、起始位置、终止位置相同
* @see java.lang.Object#equals(Object o)
*/
public
boolean
equals
(
Object
o
){
if
(
o
==
null
){
return
false
;
}
if
(
this
==
o
){
return
true
;
}
if
(
o
instanceof
Lexeme
){
Lexeme
other
=
(
Lexeme
)
o
;
if
(
this
.
offset
==
other
.
getOffset
()
&&
this
.
begin
==
other
.
getBegin
()
&&
this
.
length
==
other
.
getLength
()){
return
true
;
}
else
{
return
false
;
}
}
else
{
return
false
;
}
}
/*
* 词元哈希编码算法
* @see java.lang.Object#hashCode()
*/
public
int
hashCode
(){
int
absBegin
=
getBeginPosition
();
int
absEnd
=
getEndPosition
();
return
(
absBegin
*
37
)
+
(
absEnd
*
31
)
+
((
absBegin
*
absEnd
)
%
getLength
())
*
11
;
}
/*
* 词元在排序集合中的比较算法
* @see java.lang.Comparable#compareTo(java.lang.Object)
*/
public
int
compareTo
(
Lexeme
other
)
{
//起始位置优先
if
(
this
.
begin
<
other
.
getBegin
()){
return
-
1
;
}
else
if
(
this
.
begin
==
other
.
getBegin
()){
//词元长度优先
if
(
this
.
length
>
other
.
getLength
()){
return
-
1
;
}
else
if
(
this
.
length
==
other
.
getLength
()){
return
0
;
}
else
{
//this.length < other.getLength()
return
1
;
}
}
else
{
//this.begin > other.getBegin()
return
1
;
}
}
public
int
getOffset
()
{
return
offset
;
}
public
void
setOffset
(
int
offset
)
{
this
.
offset
=
offset
;
}
public
int
getBegin
()
{
return
begin
;
}
/**
* 获取词元在文本中的起始位置
* @return int
*/
public
int
getBeginPosition
(){
return
offset
+
begin
;
}
public
void
setBegin
(
int
begin
)
{
this
.
begin
=
begin
;
}
/**
* 获取词元在文本中的结束位置
* @return int
*/
public
int
getEndPosition
(){
return
offset
+
begin
+
length
;
}
/**
* 获取词元的字符长度
* @return int
*/
public
int
getLength
(){
return
this
.
length
;
}
public
void
setLength
(
int
length
)
{
if
(
this
.
length
<
0
){
throw
new
IllegalArgumentException
(
"length < 0"
);
}
this
.
length
=
length
;
}
/**
* 获取词元的文本内容
* @return String
*/
public
String
getLexemeText
()
{
if
(
lexemeText
==
null
){
return
""
;
}
return
lexemeText
;
}
public
void
setLexemeText
(
String
lexemeText
)
{
if
(
lexemeText
==
null
){
this
.
lexemeText
=
""
;
this
.
length
=
0
;
}
else
{
this
.
lexemeText
=
lexemeText
;
this
.
length
=
lexemeText
.
length
();
}
}
/**
* 获取词元类型
* @return int
*/
public
int
getLexemeType
()
{
return
lexemeType
;
}
/**
* 获取词元类型标示字符串
* @return String
*/
public
String
getLexemeTypeString
(){
switch
(
lexemeType
)
{
case
TYPE_ENGLISH
:
return
"ENGLISH"
;
case
TYPE_ARABIC
:
return
"ARABIC"
;
case
TYPE_LETTER
:
return
"LETTER"
;
case
TYPE_CNWORD
:
return
"CN_WORD"
;
case
TYPE_CNCHAR
:
return
"CN_CHAR"
;
case
TYPE_OTHER_CJK
:
return
"OTHER_CJK"
;
case
TYPE_COUNT
:
return
"COUNT"
;
case
TYPE_CNUM
:
return
"TYPE_CNUM"
;
case
TYPE_CQUAN:
return
"TYPE_CQUAN"
;
default
:
return
"UNKONW"
;
}
}
public
void
setLexemeType
(
int
lexemeType
)
{
this
.
lexemeType
=
lexemeType
;
}
/**
* 合并两个相邻的词元
* @param l
* @param lexemeType
* @return boolean 词元是否成功合并
*/
public
boolean
append
(
Lexeme
l
,
int
lexemeType
){
if
(
l
!=
null
&&
this
.
getEndPosition
()
==
l
.
getBeginPosition
()){
this
.
length
+=
l
.
getLength
();
this
.
lexemeType
=
lexemeType
;
return
true
;
}
else
{
return
false
;
}
}
/**
*
*/
public
String
toString
(){
StringBuffer
strbuf
=
new
StringBuffer
();
strbuf
.
append
(
this
.
getBeginPosition
()).
append
(
"-"
).
append
(
this
.
getEndPosition
());
strbuf
.
append
(
" : "
).
append
(
this
.
lexemeText
).
append
(
" : \t"
);
strbuf
.
append
(
this
.
getLexemeTypeString
());
return
strbuf
.
toString
();
}
}
src/main/java/org/wltea/analyzer/core/LexemePath.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
/**
* Lexeme链(路径)
*/
class
LexemePath
extends
QuickSortSet
implements
Comparable
<
LexemePath
>{
//起始位置
private
int
pathBegin
;
//结束
private
int
pathEnd
;
//词元链的有效字符长度
private
int
payloadLength
;
LexemePath
(){
this
.
pathBegin
=
-
1
;
this
.
pathEnd
=
-
1
;
this
.
payloadLength
=
0
;
}
/**
* 向LexemePath追加相交的Lexeme
* @param lexeme
* @return
*/
boolean
addCrossLexeme
(
Lexeme
lexeme
){
if
(
this
.
isEmpty
()){
this
.
addLexeme
(
lexeme
);
this
.
pathBegin
=
lexeme
.
getBegin
();
this
.
pathEnd
=
lexeme
.
getBegin
()
+
lexeme
.
getLength
();
this
.
payloadLength
+=
lexeme
.
getLength
();
return
true
;
}
else
if
(
this
.
checkCross
(
lexeme
)){
this
.
addLexeme
(
lexeme
);
if
(
lexeme
.
getBegin
()
+
lexeme
.
getLength
()
>
this
.
pathEnd
){
this
.
pathEnd
=
lexeme
.
getBegin
()
+
lexeme
.
getLength
();
}
this
.
payloadLength
=
this
.
pathEnd
-
this
.
pathBegin
;
return
true
;
}
else
{
return
false
;
}
}
/**
* 向LexemePath追加不相交的Lexeme
* @param lexeme
* @return
*/
boolean
addNotCrossLexeme
(
Lexeme
lexeme
){
if
(
this
.
isEmpty
()){
this
.
addLexeme
(
lexeme
);
this
.
pathBegin
=
lexeme
.
getBegin
();
this
.
pathEnd
=
lexeme
.
getBegin
()
+
lexeme
.
getLength
();
this
.
payloadLength
+=
lexeme
.
getLength
();
return
true
;
}
else
if
(
this
.
checkCross
(
lexeme
)){
return
false
;
}
else
{
this
.
addLexeme
(
lexeme
);
this
.
payloadLength
+=
lexeme
.
getLength
();
Lexeme
head
=
this
.
peekFirst
();
this
.
pathBegin
=
head
.
getBegin
();
Lexeme
tail
=
this
.
peekLast
();
this
.
pathEnd
=
tail
.
getBegin
()
+
tail
.
getLength
();
return
true
;
}
}
/**
* 移除尾部的Lexeme
* @return
*/
Lexeme
removeTail
(){
Lexeme
tail
=
this
.
pollLast
();
if
(
this
.
isEmpty
()){
this
.
pathBegin
=
-
1
;
this
.
pathEnd
=
-
1
;
this
.
payloadLength
=
0
;
}
else
{
this
.
payloadLength
-=
tail
.
getLength
();
Lexeme
newTail
=
this
.
peekLast
();
this
.
pathEnd
=
newTail
.
getBegin
()
+
newTail
.
getLength
();
}
return
tail
;
}
/**
* 检测词元位置交叉(有歧义的切分)
* @param lexeme
* @return
*/
boolean
checkCross
(
Lexeme
lexeme
){
return
(
lexeme
.
getBegin
()
>=
this
.
pathBegin
&&
lexeme
.
getBegin
()
<
this
.
pathEnd
)
||
(
this
.
pathBegin
>=
lexeme
.
getBegin
()
&&
this
.
pathBegin
<
lexeme
.
getBegin
()+
lexeme
.
getLength
());
}
int
getPathBegin
()
{
return
pathBegin
;
}
int
getPathEnd
()
{
return
pathEnd
;
}
/**
* 获取Path的有效词长
* @return
*/
int
getPayloadLength
(){
return
this
.
payloadLength
;
}
/**
* 获取LexemePath的路径长度
* @return
*/
int
getPathLength
(){
return
this
.
pathEnd
-
this
.
pathBegin
;
}
/**
* X权重(词元长度积)
* @return
*/
int
getXWeight
(){
int
product
=
1
;
Cell
c
=
this
.
getHead
();
while
(
c
!=
null
&&
c
.
getLexeme
()
!=
null
){
product
*=
c
.
getLexeme
().
getLength
();
c
=
c
.
getNext
();
}
return
product
;
}
/**
* 词元位置权重
* @return
*/
int
getPWeight
(){
int
pWeight
=
0
;
int
p
=
0
;
Cell
c
=
this
.
getHead
();
while
(
c
!=
null
&&
c
.
getLexeme
()
!=
null
){
p
++;
pWeight
+=
p
*
c
.
getLexeme
().
getLength
()
;
c
=
c
.
getNext
();
}
return
pWeight
;
}
LexemePath
copy
(){
LexemePath
theCopy
=
new
LexemePath
();
theCopy
.
pathBegin
=
this
.
pathBegin
;
theCopy
.
pathEnd
=
this
.
pathEnd
;
theCopy
.
payloadLength
=
this
.
payloadLength
;
Cell
c
=
this
.
getHead
();
while
(
c
!=
null
&&
c
.
getLexeme
()
!=
null
){
theCopy
.
addLexeme
(
c
.
getLexeme
());
c
=
c
.
getNext
();
}
return
theCopy
;
}
public
int
compareTo
(
LexemePath
o
)
{
//比较有效文本长度
if
(
this
.
payloadLength
>
o
.
payloadLength
){
return
-
1
;
}
else
if
(
this
.
payloadLength
<
o
.
payloadLength
){
return
1
;
}
else
{
//比较词元个数,越少越好
if
(
this
.
size
()
<
o
.
size
()){
return
-
1
;
}
else
if
(
this
.
size
()
>
o
.
size
()){
return
1
;
}
else
{
//路径跨度越大越好
if
(
this
.
getPathLength
()
>
o
.
getPathLength
()){
return
-
1
;
}
else
if
(
this
.
getPathLength
()
<
o
.
getPathLength
()){
return
1
;
}
else
{
//根据统计学结论,逆向切分概率高于正向切分,因此位置越靠后的优先
if
(
this
.
pathEnd
>
o
.
pathEnd
){
return
-
1
;
}
else
if
(
pathEnd
<
o
.
pathEnd
){
return
1
;
}
else
{
//词长越平均越好
if
(
this
.
getXWeight
()
>
o
.
getXWeight
()){
return
-
1
;
}
else
if
(
this
.
getXWeight
()
<
o
.
getXWeight
()){
return
1
;
}
else
{
//词元位置权重比较
if
(
this
.
getPWeight
()
>
o
.
getPWeight
()){
return
-
1
;
}
else
if
(
this
.
getPWeight
()
<
o
.
getPWeight
()){
return
1
;
}
}
}
}
}
}
return
0
;
}
public
String
toString
(){
StringBuffer
sb
=
new
StringBuffer
();
sb
.
append
(
"pathBegin : "
).
append
(
pathBegin
).
append
(
"\r\n"
);
sb
.
append
(
"pathEnd : "
).
append
(
pathEnd
).
append
(
"\r\n"
);
sb
.
append
(
"payloadLength : "
).
append
(
payloadLength
).
append
(
"\r\n"
);
Cell
head
=
this
.
getHead
();
while
(
head
!=
null
){
sb
.
append
(
"lexeme : "
).
append
(
head
.
getLexeme
()).
append
(
"\r\n"
);
head
=
head
.
getNext
();
}
return
sb
.
toString
();
}
}
src/main/java/org/wltea/analyzer/core/QuickSortSet.java
0 → 100644
浏览文件 @
7d91be50
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.core
;
/**
* IK分词器专用的Lexem快速排序集合
*/
class
QuickSortSet
{
//链表头
private
Cell
head
;
//链表尾
private
Cell
tail
;
//链表的实际大小
private
int
size
;
QuickSortSet
(){
this
.
size
=
0
;
}
/**
* 向链表集合添加词元
* @param lexeme
*/
boolean
addLexeme
(
Lexeme
lexeme
){
Cell
newCell
=
new
Cell
(
lexeme
);
if
(
this
.
size
==
0
){
this
.
head
=
newCell
;
this
.
tail
=
newCell
;
this
.
size
++;
return
true
;
}
else
{
if
(
this
.
tail
.
compareTo
(
newCell
)
==
0
){
//词元与尾部词元相同,不放入集合
return
false
;
}
else
if
(
this
.
tail
.
compareTo
(
newCell
)
<
0
){
//词元接入链表尾部
this
.
tail
.
next
=
newCell
;
newCell
.
prev
=
this
.
tail
;
this
.
tail
=
newCell
;
this
.
size
++;
return
true
;
}
else
if
(
this
.
head
.
compareTo
(
newCell
)
>
0
){
//词元接入链表头部
this
.
head
.
prev
=
newCell
;
newCell
.
next
=
this
.
head
;
this
.
head
=
newCell
;
this
.
size
++;
return
true
;
}
else
{
//从尾部上逆
Cell
index
=
this
.
tail
;
while
(
index
!=
null
&&
index
.
compareTo
(
newCell
)
>
0
){
index
=
index
.
prev
;
}
if
(
index
.
compareTo
(
newCell
)
==
0
){
//词元与集合中的词元重复,不放入集合
return
false
;
}
else
if
(
index
.
compareTo
(
newCell
)
<
0
){
//词元插入链表中的某个位置
newCell
.
prev
=
index
;
newCell
.
next
=
index
.
next
;
index
.
next
.
prev
=
newCell
;
index
.
next
=
newCell
;
this
.
size
++;
return
true
;
}
}
}
return
false
;
}
/**
* 返回链表头部元素
* @return
*/
Lexeme
peekFirst
(){
if
(
this
.
head
!=
null
){
return
this
.
head
.
lexeme
;
}
return
null
;
}
/**
* 取出链表集合的第一个元素
* @return Lexeme
*/
Lexeme
pollFirst
(){
if
(
this
.
size
==
1
){
Lexeme
first
=
this
.
head
.
lexeme
;
this
.
head
=
null
;
this
.
tail
=
null
;
this
.
size
--;
return
first
;
}
else
if
(
this
.
size
>
1
){
Lexeme
first
=
this
.
head
.
lexeme
;
this
.
head
=
this
.
head
.
next
;
this
.
size
--;
return
first
;
}
else
{
return
null
;
}
}
/**
* 返回链表尾部元素
* @return
*/
Lexeme
peekLast
(){
if
(
this
.
tail
!=
null
){
return
this
.
tail
.
lexeme
;
}
return
null
;
}
/**
* 取出链表集合的最后一个元素
* @return Lexeme
*/
Lexeme
pollLast
(){
if
(
this
.
size
==
1
){
Lexeme
last
=
this
.
head
.
lexeme
;
this
.
head
=
null
;
this
.
tail
=
null
;
this
.
size
--;
return
last
;
}
else
if
(
this
.
size
>
1
){
Lexeme
last
=
this
.
tail
.
lexeme
;
this
.
tail
=
this
.
tail
.
prev
;
this
.
size
--;
return
last
;
}
else
{
return
null
;
}
}
/**
* 返回集合大小
* @return
*/
int
size
(){
return
this
.
size
;
}
/**
* 判断集合是否为空
* @return
*/
boolean
isEmpty
(){
return
this
.
size
==
0
;
}
/**
* 返回lexeme链的头部
* @return
*/
Cell
getHead
(){
return
this
.
head
;
}
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
* QuickSortSet集合单元
*
*/
class
Cell
implements
Comparable
<
Cell
>{
private
Cell
prev
;
private
Cell
next
;
private
Lexeme
lexeme
;
Cell
(
Lexeme
lexeme
){
if
(
lexeme
==
null
){
throw
new
IllegalArgumentException
(
"lexeme must not be null"
);
}
this
.
lexeme
=
lexeme
;
}
public
int
compareTo
(
Cell
o
)
{
return
this
.
lexeme
.
compareTo
(
o
.
lexeme
);
}
public
Cell
getPrev
(){
return
this
.
prev
;
}
public
Cell
getNext
(){
return
this
.
next
;
}
public
Lexeme
getLexeme
(){
return
this
.
lexeme
;
}
}
}
src/main/java/org/wltea/analyzer/dic/DictSegment.java
浏览文件 @
7d91be50
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.dic
;
import
java.util.Arrays
;
import
java.util.HashMap
;
import
java.util.Map
;
public
class
DictSegment
{
/**
* 词典树分段,表示词典树的一个分枝
*/
class
DictSegment
implements
Comparable
<
DictSegment
>{
//公用字典表,存储汉字
private
static
final
Map
<
Character
,
Character
>
charMap
=
new
HashMap
<
Character
,
Character
>(
16
,
0.95f
);
//数组大小上限
private
static
final
int
ARRAY_LENGTH_LIMIT
=
3
;
private
Character
nodeChar
;
//Map存储结构
private
Map
<
Character
,
DictSegment
>
childrenMap
;
//数组方式存储结构
private
DictSegment
[]
childrenArray
;
private
int
storeSize
=
0
;
//当前节点上存储的字符
private
Character
nodeChar
;
//当前节点存储的Segment数目
//storeSize <=ARRAY_LENGTH_LIMIT ,使用数组存储, storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储
private
int
storeSize
=
0
;
//当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词
private
int
nodeState
=
0
;
public
DictSegment
(
Character
nodeChar
){
DictSegment
(
Character
nodeChar
){
if
(
nodeChar
==
null
){
throw
new
IllegalArgumentException
(
"参数为空异常,字符不能为空"
);
}
this
.
nodeChar
=
nodeChar
;
}
public
int
getDicNum
(){
if
(
charMap
!=
null
)
{
return
charMap
.
size
();
}
return
0
;
}
public
Character
getNodeChar
()
{
Character
getNodeChar
()
{
return
nodeChar
;
}
/*
* 判断是否有下一个节点
*/
public
boolean
hasNextNode
(){
boolean
hasNextNode
(){
return
this
.
storeSize
>
0
;
}
...
...
@@ -62,7 +78,7 @@ public class DictSegment {
* @param charArray
* @return Hit
*/
public
Hit
match
(
char
[]
charArray
){
Hit
match
(
char
[]
charArray
){
return
this
.
match
(
charArray
,
0
,
charArray
.
length
,
null
);
}
...
...
@@ -73,7 +89,7 @@ public class DictSegment {
* @param length
* @return Hit
*/
public
Hit
match
(
char
[]
charArray
,
int
begin
,
int
length
){
Hit
match
(
char
[]
charArray
,
int
begin
,
int
length
){
return
this
.
match
(
charArray
,
begin
,
length
,
null
);
}
...
...
@@ -85,64 +101,64 @@ public class DictSegment {
* @param searchHit
* @return Hit
*/
public
Hit
match
(
char
[]
charArray
,
int
begin
,
int
length
,
Hit
searchHit
){
Hit
match
(
char
[]
charArray
,
int
begin
,
int
length
,
Hit
searchHit
){
if
(
searchHit
==
null
){
//如果hit为空,新建
searchHit
=
new
Hit
();
//设置hit的其实文本位置
searchHit
.
setBegin
(
begin
);
}
else
{
//否则要将HIT状态重置
searchHit
.
setUnmatch
();
}
//设置hit的当前处理位置
searchHit
.
setEnd
(
begin
);
Character
keyChar
=
new
Character
(
charArray
[
begin
]);
DictSegment
ds
=
null
;
//引用实例变量为本地变量,避免查询时遇到更新的同步问题
DictSegment
[]
segmentArray
=
this
.
childrenArray
;
Map
<
Character
,
DictSegment
>
segmentMap
=
this
.
childrenMap
;
//STEP1 在节点中查找keyChar对应的DictSegment
if
(
segmentArray
!=
null
){
for
(
DictSegment
seg
:
segmentArray
){
if
(
seg
!=
null
&&
seg
.
nodeChar
.
equals
(
keyChar
)){
ds
=
seg
;
}
//在数组中查找
DictSegment
keySegment
=
new
DictSegment
(
keyChar
);
int
position
=
Arrays
.
binarySearch
(
segmentArray
,
0
,
this
.
storeSize
,
keySegment
);
if
(
position
>=
0
){
ds
=
segmentArray
[
position
];
}
}
else
if
(
segmentMap
!=
null
){
}
else
if
(
segmentMap
!=
null
){
//在map中查找
ds
=
(
DictSegment
)
segmentMap
.
get
(
keyChar
);
}
//STEP2 找到DictSegment,判断词的匹配状态,是否继续递归,还是返回结果
if
(
ds
!=
null
){
if
(
length
>
1
){
//词未匹配完,继续往下搜索
return
ds
.
match
(
charArray
,
begin
+
1
,
length
-
1
,
searchHit
);
}
else
if
(
length
==
1
){
//搜索最后一个char
if
(
ds
.
nodeState
==
1
){
//添加HIT状态为完全匹配
searchHit
.
setMatch
();
}
if
(
ds
.
hasNextNode
()){
//添加HIT状态为前缀匹配
searchHit
.
setPrefix
();
//记录当前位置的DictSegment
searchHit
.
setMatchedDictSegment
(
ds
);
}
return
searchHit
;
}
}
//STEP3 没有找到DictSegment, 将HIT设置为不匹配
return
searchHit
;
}
...
...
@@ -150,8 +166,16 @@ public class DictSegment {
* 加载填充词典片段
* @param charArray
*/
public
void
fillSegment
(
char
[]
charArray
){
this
.
fillSegment
(
charArray
,
0
,
charArray
.
length
);
void
fillSegment
(
char
[]
charArray
){
this
.
fillSegment
(
charArray
,
0
,
charArray
.
length
,
1
);
}
/**
* 屏蔽词典中的一个词
* @param charArray
*/
void
disableSegment
(
char
[]
charArray
){
this
.
fillSegment
(
charArray
,
0
,
charArray
.
length
,
0
);
}
/**
...
...
@@ -159,86 +183,90 @@ public class DictSegment {
* @param charArray
* @param begin
* @param length
* @param enabled
*/
p
ublic
synchronized
void
fillSegment
(
char
[]
charArray
,
int
begin
,
int
length
){
p
rivate
synchronized
void
fillSegment
(
char
[]
charArray
,
int
begin
,
int
length
,
int
enabled
){
//获取字典表中的汉字对象
Character
beginChar
=
new
Character
(
charArray
[
begin
]);
Character
keyChar
=
charMap
.
get
(
beginChar
);
//字典中没有该字,则将其添加入字典
if
(
keyChar
==
null
){
charMap
.
put
(
beginChar
,
beginChar
);
keyChar
=
beginChar
;
}
DictSegment
ds
=
lookforSegment
(
keyChar
);
if
(
length
>
1
){
ds
.
fillSegment
(
charArray
,
begin
+
1
,
length
-
1
);
}
else
if
(
length
==
1
){
ds
.
nodeState
=
1
;
//搜索当前节点的存储,查询对应keyChar的keyChar,如果没有则创建
DictSegment
ds
=
lookforSegment
(
keyChar
,
enabled
);
if
(
ds
!=
null
){
//处理keyChar对应的segment
if
(
length
>
1
){
//词元还没有完全加入词典树
ds
.
fillSegment
(
charArray
,
begin
+
1
,
length
-
1
,
enabled
);
}
else
if
(
length
==
1
){
//已经是词元的最后一个char,设置当前节点状态为enabled,
//enabled=1表明一个完整的词,enabled=0表示从词典中屏蔽当前词
ds
.
nodeState
=
enabled
;
}
}
}
/**
* 查找本节点下对应的keyChar的segment
* 如果没有找到,则创建新的segment
* 查找本节点下对应的keyChar的segment *
* @param keyChar
* @param create =1如果没有找到,则创建新的segment ; =0如果没有找到,不创建,返回null
* @return
*/
private
DictSegment
lookforSegment
(
Character
keyChar
){
private
DictSegment
lookforSegment
(
Character
keyChar
,
int
create
){
DictSegment
ds
=
null
;
if
(
this
.
storeSize
<=
ARRAY_LENGTH_LIMIT
){
//获取数组容器,如果数组未创建则创建数组
DictSegment
[]
segmentArray
=
getChildrenArray
();
for
(
DictSegment
segment
:
segmentArray
){
if
(
segment
!=
null
&&
segment
.
nodeChar
.
equals
(
keyChar
)){
ds
=
segment
;
break
;
}
}
if
(
ds
==
null
){
ds
=
new
DictSegment
(
keyChar
);
//搜寻数组
DictSegment
keySegment
=
new
DictSegment
(
keyChar
);
int
position
=
Arrays
.
binarySearch
(
segmentArray
,
0
,
this
.
storeSize
,
keySegment
);
if
(
position
>=
0
){
ds
=
segmentArray
[
position
];
}
//遍历数组后没有找到对应的segment
if
(
ds
==
null
&&
create
==
1
){
ds
=
keySegment
;
if
(
this
.
storeSize
<
ARRAY_LENGTH_LIMIT
){
//数组容量未满,使用数组存储
segmentArray
[
this
.
storeSize
]
=
ds
;
//segment数目+1
this
.
storeSize
++;
Arrays
.
sort
(
segmentArray
,
0
,
this
.
storeSize
);
}
else
{
//数组容量已满,切换Map存储
//获取Map容器,如果Map未创建,则创建Map
Map
<
Character
,
DictSegment
>
segmentMap
=
getChildrenMap
();
//将数组中的segment迁移到Map中
migrate
(
segmentArray
,
segmentMap
);
//存储新的segment
segmentMap
.
put
(
keyChar
,
ds
);
//segment数目+1 , 必须在释放数组前执行storeSize++ , 确保极端情况下,不会取到空的数组
this
.
storeSize
++;
//释放当前的数组引用
this
.
childrenArray
=
null
;
}
}
}
else
{
//获取Map容器,如果Map未创建,则创建Map
Map
<
Character
,
DictSegment
>
segmentMap
=
getChildrenMap
();
//搜索Map
ds
=
(
DictSegment
)
segmentMap
.
get
(
keyChar
);
if
(
ds
==
null
){
if
(
ds
==
null
&&
create
==
1
){
//构造新的segment
ds
=
new
DictSegment
(
keyChar
);
segmentMap
.
put
(
keyChar
,
ds
);
//当前节点存储segment数目+1
this
.
storeSize
++;
}
}
...
...
@@ -288,5 +316,23 @@ public class DictSegment {
}
}
}
/**
* 实现Comparable接口
* @param o
* @return int
*/
public
int
compareTo
(
DictSegment
o
)
{
//对当前节点存储的char进行比较
return
this
.
nodeChar
.
compareTo
(
o
.
nodeChar
);
}
public
int
getDicNum
(){
if
(
charMap
!=
null
)
{
return
charMap
.
size
();
}
return
0
;
}
}
src/main/java/org/wltea/analyzer/dic/Dictionary.java
浏览文件 @
7d91be50
...
...
@@ -47,15 +47,15 @@ public class Dictionary {
logger
=
Loggers
.
getLogger
(
"ik-analyzer"
);
}
public
Configuration
getConfig
(){
return
configuration
;
}
public
void
Init
(
Settings
settings
){
// logger.info("[Init Setting] {}",settings.getAsMap().toString());
public
void
Init
(
Settings
indexSettings
){
if
(!
dictInited
){
environment
=
new
Environment
(
s
ettings
);
configuration
=
new
Configuration
(
s
ettings
);
environment
=
new
Environment
(
indexS
ettings
);
configuration
=
new
Configuration
(
indexS
ettings
);
loadMainDict
();
loadSurnameDict
();
loadQuantifierDict
();
...
...
@@ -71,16 +71,6 @@ public class Dictionary {
File
file
=
new
File
(
environment
.
configFile
(),
Dictionary
.
PATH_DIC_MAIN
);
// logger.info("[Main Dict Loading] {}",file.getAbsolutePath());
// logger.info("[Environment] {}",environment.homeFile());
// logger.info("[Environment] {}",environment.workFile());
// logger.info("[Environment] {}",environment.workWithClusterFile());
// logger.info("[Environment] {}",environment.dataFiles());
// logger.info("[Environment] {}",environment.dataWithClusterFiles());
// logger.info("[Environment] {}",environment.configFile());
// logger.info("[Environment] {}",environment.pluginsFile());
// logger.info("[Environment] {}",environment.logsFile());
InputStream
is
=
null
;
try
{
is
=
new
FileInputStream
(
file
);
...
...
@@ -142,7 +132,7 @@ public class Dictionary {
if
(
theWord
!=
null
&&
!
""
.
equals
(
theWord
.
trim
()))
{
_MainDict
.
fillSegment
(
theWord
.
trim
().
toCharArray
());
_MainDict
.
fillSegment
(
theWord
.
trim
().
to
LowerCase
().
to
CharArray
());
}
}
while
(
theWord
!=
null
);
logger
.
info
(
"[Dict Loading] {},MainDict Size:{}"
,
tempFile
.
toString
(),
_MainDict
.
getDicNum
());
...
...
src/main/java/org/wltea/analyzer/dic/Hit.java
浏览文件 @
7d91be50
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package
org.wltea.analyzer.dic
;
/**
* 表示一次词典匹配的命中
*/
public
class
Hit
{
//Hit不匹配
private
static
final
int
UNMATCH
=
0x00000000
;
//Hit完全匹配
private
static
final
int
MATCH
=
0x00000001
;
//Hit前缀匹配
private
static
final
int
PREFIX
=
0x00000010
;
//该HIT当前状态,默认未匹配
private
int
hitState
=
UNMATCH
;
//记录词典匹配过程中,当前匹配到的词典分支节点
private
DictSegment
matchedDictSegment
;
/*
* 词段开始位置
*/
private
int
begin
;
/*
* 词段的结束位置
*/
private
int
end
;
/**
* 判断是否完全匹配
*/
public
boolean
isMatch
()
{
return
(
this
.
hitState
&
MATCH
)
>
0
;
}
...
...
@@ -32,6 +63,9 @@ public class Hit {
this
.
hitState
=
this
.
hitState
|
MATCH
;
}
/**
* 判断是否是词的前缀
*/
public
boolean
isPrefix
()
{
return
(
this
.
hitState
&
PREFIX
)
>
0
;
}
...
...
@@ -39,7 +73,9 @@ public class Hit {
public
void
setPrefix
()
{
this
.
hitState
=
this
.
hitState
|
PREFIX
;
}
/**
* 判断是否是不匹配
*/
public
boolean
isUnmatch
()
{
return
this
.
hitState
==
UNMATCH
;
}
...
...
src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java
浏览文件 @
7d91be50
...
...
@@ -13,8 +13,9 @@ import java.io.Reader;
public
final
class
IKAnalyzer
extends
Analyzer
{
private
boolean
isMaxWordLength
=
false
;
private
boolean
useSmart
=
false
;
public
IKAnalyzer
(){
public
IKAnalyzer
(){
this
(
false
);
}
...
...
@@ -24,14 +25,19 @@ public final class IKAnalyzer extends Analyzer {
this
.
setMaxWordLength
(
isMaxWordLength
);
}
public
IKAnalyzer
(
Settings
settings
)
{
Dictionary
.
getInstance
().
Init
(
settings
);
public
IKAnalyzer
(
Settings
indexSetting
,
Settings
settings1
)
{
super
();
Dictionary
.
getInstance
().
Init
(
indexSetting
);
if
(
settings1
.
get
(
"use_smart"
,
"true"
).
equals
(
"true"
)){
useSmart
=
true
;
}
}
@Override
public
TokenStream
tokenStream
(
String
fieldName
,
Reader
reader
)
{
return
new
IKTokenizer
(
reader
,
isMaxWordLength
()
);
return
new
IKTokenizer
(
reader
,
useSmart
);
}
public
void
setMaxWordLength
(
boolean
isMaxWordLength
)
{
...
...
src/main/java/org/wltea/analyzer/lucene/IKQueryParser.java
已删除
100644 → 0
浏览文件 @
a6ed160a
/**
*
*/
package
org.wltea.analyzer.lucene
;
import
java.io.IOException
;
import
java.io.StringReader
;
import
java.util.ArrayList
;
import
java.util.HashMap
;
import
java.util.List
;
import
java.util.Map
;
import
org.apache.lucene.index.Term
;
import
org.apache.lucene.search.BooleanClause
;
import
org.apache.lucene.search.BooleanQuery
;
import
org.apache.lucene.search.Query
;
import
org.apache.lucene.search.TermQuery
;
import
org.apache.lucene.search.BooleanClause.Occur
;
import
org.wltea.analyzer.IKSegmentation
;
import
org.wltea.analyzer.Lexeme
;
public
final
class
IKQueryParser
{
private
static
ThreadLocal
<
Map
<
String
,
TokenBranch
>>
keywordCacheThreadLocal
=
new
ThreadLocal
<
Map
<
String
,
TokenBranch
>>();
private
static
boolean
isMaxWordLength
=
false
;
public
static
void
setMaxWordLength
(
boolean
isMaxWordLength
)
{
IKQueryParser
.
isMaxWordLength
=
isMaxWordLength
;
}
private
static
Query
optimizeQueries
(
List
<
Query
>
queries
){
if
(
queries
.
size
()
==
0
){
return
null
;
}
else
if
(
queries
.
size
()
==
1
){
return
queries
.
get
(
0
);
}
else
{
BooleanQuery
mustQueries
=
new
BooleanQuery
();
for
(
Query
q
:
queries
){
mustQueries
.
add
(
q
,
Occur
.
MUST
);
}
return
mustQueries
;
}
}
private
static
Map
<
String
,
TokenBranch
>
getTheadLocalCache
(){
Map
<
String
,
TokenBranch
>
keywordCache
=
keywordCacheThreadLocal
.
get
();
if
(
keywordCache
==
null
){
keywordCache
=
new
HashMap
<
String
,
TokenBranch
>(
4
);
keywordCacheThreadLocal
.
set
(
keywordCache
);
}
return
keywordCache
;
}
private
static
TokenBranch
getCachedTokenBranch
(
String
query
){
Map
<
String
,
TokenBranch
>
keywordCache
=
getTheadLocalCache
();
return
keywordCache
.
get
(
query
);
}
private
static
void
cachedTokenBranch
(
String
query
,
TokenBranch
tb
){
Map
<
String
,
TokenBranch
>
keywordCache
=
getTheadLocalCache
();
keywordCache
.
put
(
query
,
tb
);
}
private
static
Query
_parse
(
String
field
,
String
query
)
throws
IOException
{
if
(
field
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"field\" is null"
);
}
if
(
query
==
null
||
""
.
equals
(
query
.
trim
())){
return
new
TermQuery
(
new
Term
(
field
));
}
TokenBranch
root
=
getCachedTokenBranch
(
query
);
if
(
root
!=
null
){
return
optimizeQueries
(
root
.
toQueries
(
field
));
}
else
{
root
=
new
TokenBranch
(
null
);
StringReader
input
=
new
StringReader
(
query
.
trim
());
IKSegmentation
ikSeg
=
new
IKSegmentation
(
input
,
isMaxWordLength
);
for
(
Lexeme
lexeme
=
ikSeg
.
next
()
;
lexeme
!=
null
;
lexeme
=
ikSeg
.
next
()){
root
.
accept
(
lexeme
);
}
cachedTokenBranch
(
query
,
root
);
return
optimizeQueries
(
root
.
toQueries
(
field
));
}
}
public
static
Query
parse
(
String
field
,
String
query
)
throws
IOException
{
if
(
field
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"field\" is null"
);
}
String
[]
qParts
=
query
.
split
(
"\\s"
);
if
(
qParts
.
length
>
1
){
BooleanQuery
resultQuery
=
new
BooleanQuery
();
for
(
String
q
:
qParts
){
if
(
""
.
equals
(
q
)){
continue
;
}
Query
partQuery
=
_parse
(
field
,
q
);
if
(
partQuery
!=
null
&&
(!(
partQuery
instanceof
BooleanQuery
)
||
((
BooleanQuery
)
partQuery
).
getClauses
().
length
>
0
)){
resultQuery
.
add
(
partQuery
,
Occur
.
SHOULD
);
}
}
return
resultQuery
;
}
else
{
return
_parse
(
field
,
query
);
}
}
public
static
Query
parseMultiField
(
String
[]
fields
,
String
query
)
throws
IOException
{
if
(
fields
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"fields\" is null"
);
}
BooleanQuery
resultQuery
=
new
BooleanQuery
();
for
(
String
field
:
fields
){
if
(
field
!=
null
){
Query
partQuery
=
parse
(
field
,
query
);
if
(
partQuery
!=
null
&&
(!(
partQuery
instanceof
BooleanQuery
)
||
((
BooleanQuery
)
partQuery
).
getClauses
().
length
>
0
)){
resultQuery
.
add
(
partQuery
,
Occur
.
SHOULD
);
}
}
}
return
resultQuery
;
}
public
static
Query
parseMultiField
(
String
[]
fields
,
String
query
,
BooleanClause
.
Occur
[]
flags
)
throws
IOException
{
if
(
fields
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"fields\" is null"
);
}
if
(
flags
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"flags\" is null"
);
}
if
(
flags
.
length
!=
fields
.
length
){
throw
new
IllegalArgumentException
(
"flags.length != fields.length"
);
}
BooleanQuery
resultQuery
=
new
BooleanQuery
();
for
(
int
i
=
0
;
i
<
fields
.
length
;
i
++){
if
(
fields
[
i
]
!=
null
){
Query
partQuery
=
parse
(
fields
[
i
]
,
query
);
if
(
partQuery
!=
null
&&
(!(
partQuery
instanceof
BooleanQuery
)
||
((
BooleanQuery
)
partQuery
).
getClauses
().
length
>
0
)){
resultQuery
.
add
(
partQuery
,
flags
[
i
]);
}
}
}
return
resultQuery
;
}
public
static
Query
parseMultiField
(
String
[]
fields
,
String
[]
queries
)
throws
IOException
{
if
(
fields
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"fields\" is null"
);
}
if
(
queries
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"queries\" is null"
);
}
if
(
queries
.
length
!=
fields
.
length
){
throw
new
IllegalArgumentException
(
"queries.length != fields.length"
);
}
BooleanQuery
resultQuery
=
new
BooleanQuery
();
for
(
int
i
=
0
;
i
<
fields
.
length
;
i
++){
if
(
fields
[
i
]
!=
null
){
Query
partQuery
=
parse
(
fields
[
i
]
,
queries
[
i
]);
if
(
partQuery
!=
null
&&
(!(
partQuery
instanceof
BooleanQuery
)
||
((
BooleanQuery
)
partQuery
).
getClauses
().
length
>
0
)){
resultQuery
.
add
(
partQuery
,
Occur
.
SHOULD
);
}
}
}
return
resultQuery
;
}
public
static
Query
parseMultiField
(
String
[]
fields
,
String
[]
queries
,
BooleanClause
.
Occur
[]
flags
)
throws
IOException
{
if
(
fields
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"fields\" is null"
);
}
if
(
queries
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"queries\" is null"
);
}
if
(
flags
==
null
){
throw
new
IllegalArgumentException
(
"parameter \"flags\" is null"
);
}
if
(!(
queries
.
length
==
fields
.
length
&&
queries
.
length
==
flags
.
length
)){
throw
new
IllegalArgumentException
(
"queries, fields, and flags array have have different length"
);
}
BooleanQuery
resultQuery
=
new
BooleanQuery
();
for
(
int
i
=
0
;
i
<
fields
.
length
;
i
++){
if
(
fields
[
i
]
!=
null
){
Query
partQuery
=
parse
(
fields
[
i
]
,
queries
[
i
]);
if
(
partQuery
!=
null
&&
(!(
partQuery
instanceof
BooleanQuery
)
||
((
BooleanQuery
)
partQuery
).
getClauses
().
length
>
0
)){
resultQuery
.
add
(
partQuery
,
flags
[
i
]);
}
}
}
return
resultQuery
;
}
private
static
class
TokenBranch
{
private
static
final
int
REFUSED
=
-
1
;
private
static
final
int
ACCEPTED
=
0
;
private
static
final
int
TONEXT
=
1
;
private
int
leftBorder
;
private
int
rightBorder
;
private
Lexeme
lexeme
;
private
List
<
TokenBranch
>
acceptedBranchs
;
private
TokenBranch
nextBranch
;
TokenBranch
(
Lexeme
lexeme
){
if
(
lexeme
!=
null
){
this
.
lexeme
=
lexeme
;
this
.
leftBorder
=
lexeme
.
getBeginPosition
();
this
.
rightBorder
=
lexeme
.
getEndPosition
();
}
}
public
int
getLeftBorder
()
{
return
leftBorder
;
}
public
int
getRightBorder
()
{
return
rightBorder
;
}
public
Lexeme
getLexeme
()
{
return
lexeme
;
}
public
List
<
TokenBranch
>
getAcceptedBranchs
()
{
return
acceptedBranchs
;
}
public
TokenBranch
getNextBranch
()
{
return
nextBranch
;
}
public
int
hashCode
(){
if
(
this
.
lexeme
==
null
){
return
0
;
}
else
{
return
this
.
lexeme
.
hashCode
()
*
37
;
}
}
public
boolean
equals
(
Object
o
){
if
(
o
==
null
){
return
false
;
}
if
(
this
==
o
){
return
true
;
}
if
(
o
instanceof
TokenBranch
){
TokenBranch
other
=
(
TokenBranch
)
o
;
if
(
this
.
lexeme
==
null
||
other
.
getLexeme
()
==
null
){
return
false
;
}
else
{
return
this
.
lexeme
.
equals
(
other
.
getLexeme
());
}
}
else
{
return
false
;
}
}
boolean
accept
(
Lexeme
_lexeme
){
/*
* 检查新的lexeme 对当前的branch 的可接受类型
* acceptType : REFUSED 不能接受
* acceptType : ACCEPTED 接受
* acceptType : TONEXT 由相邻分支接受
*/
int
acceptType
=
checkAccept
(
_lexeme
);
switch
(
acceptType
){
case
REFUSED:
return
false
;
case
ACCEPTED
:
if
(
acceptedBranchs
==
null
){
acceptedBranchs
=
new
ArrayList
<
TokenBranch
>(
2
);
acceptedBranchs
.
add
(
new
TokenBranch
(
_lexeme
));
}
else
{
boolean
acceptedByChild
=
false
;
for
(
TokenBranch
childBranch
:
acceptedBranchs
){
acceptedByChild
=
childBranch
.
accept
(
_lexeme
)
||
acceptedByChild
;
}
if
(!
acceptedByChild
){
acceptedBranchs
.
add
(
new
TokenBranch
(
_lexeme
));
}
}
if
(
_lexeme
.
getEndPosition
()
>
this
.
rightBorder
){
this
.
rightBorder
=
_lexeme
.
getEndPosition
();
}
break
;
case
TONEXT
:
if
(
this
.
nextBranch
==
null
){
this
.
nextBranch
=
new
TokenBranch
(
null
);
}
this
.
nextBranch
.
accept
(
_lexeme
);
break
;
}
return
true
;
}
List
<
Query
>
toQueries
(
String
fieldName
){
List
<
Query
>
queries
=
new
ArrayList
<
Query
>(
1
);
if
(
lexeme
!=
null
){
queries
.
add
(
new
TermQuery
(
new
Term
(
fieldName
,
lexeme
.
getLexemeText
())));
}
if
(
acceptedBranchs
!=
null
&&
acceptedBranchs
.
size
()
>
0
){
if
(
acceptedBranchs
.
size
()
==
1
){
Query
onlyOneQuery
=
optimizeQueries
(
acceptedBranchs
.
get
(
0
).
toQueries
(
fieldName
));
if
(
onlyOneQuery
!=
null
){
queries
.
add
(
onlyOneQuery
);
}
}
else
{
BooleanQuery
orQuery
=
new
BooleanQuery
();
for
(
TokenBranch
childBranch
:
acceptedBranchs
){
Query
childQuery
=
optimizeQueries
(
childBranch
.
toQueries
(
fieldName
));
if
(
childQuery
!=
null
){
orQuery
.
add
(
childQuery
,
Occur
.
SHOULD
);
}
}
if
(
orQuery
.
getClauses
().
length
>
0
){
queries
.
add
(
orQuery
);
}
}
}
if
(
nextBranch
!=
null
){
queries
.
addAll
(
nextBranch
.
toQueries
(
fieldName
));
}
return
queries
;
}
private
int
checkAccept
(
Lexeme
_lexeme
){
int
acceptType
=
0
;
if
(
_lexeme
==
null
){
throw
new
IllegalArgumentException
(
"parameter:lexeme is null"
);
}
if
(
null
==
this
.
lexeme
){
if
(
this
.
rightBorder
>
0
&&
_lexeme
.
getBeginPosition
()
>=
this
.
rightBorder
){
acceptType
=
TONEXT
;
}
else
{
acceptType
=
ACCEPTED
;
}
}
else
{
if
(
_lexeme
.
getBeginPosition
()
<
this
.
lexeme
.
getBeginPosition
()){
acceptType
=
REFUSED
;
}
else
if
(
_lexeme
.
getBeginPosition
()
>=
this
.
lexeme
.
getBeginPosition
()
&&
_lexeme
.
getBeginPosition
()
<
this
.
lexeme
.
getEndPosition
()){
acceptType
=
REFUSED
;
}
else
if
(
_lexeme
.
getBeginPosition
()
>=
this
.
lexeme
.
getEndPosition
()
&&
_lexeme
.
getBeginPosition
()
<
this
.
rightBorder
){
acceptType
=
ACCEPTED
;
}
else
{
acceptType
=
TONEXT
;
}
}
return
acceptType
;
}
}
}
src/main/java/org/wltea/analyzer/lucene/IKSimilarity.java
已删除
100644 → 0
浏览文件 @
a6ed160a
/**
*
*/
package
org.wltea.analyzer.lucene
;
import
org.apache.lucene.search.DefaultSimilarity
;
public
class
IKSimilarity
extends
DefaultSimilarity
{
private
static
final
long
serialVersionUID
=
7558565500061194774L
;
public
float
coord
(
int
overlap
,
int
maxOverlap
)
{
float
overlap2
=
(
float
)
Math
.
pow
(
2
,
overlap
);
float
maxOverlap2
=
(
float
)
Math
.
pow
(
2
,
maxOverlap
);
return
(
overlap2
/
maxOverlap2
);
}
}
src/main/java/org/wltea/analyzer/lucene/IKTokenizer.java
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/query/IKQueryExpressionParser.java
0 → 100644
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/query/SWMCQueryBuilder.java
0 → 100644
浏览文件 @
7d91be50
///**
// * IK 中文分词 版本 5.0
// * IK Analyzer release 5.0
// *
// * Licensed to the Apache Software Foundation (ASF) under one or more
// * contributor license agreements. See the NOTICE file distributed with
// * this work for additional information regarding copyright ownership.
// * The ASF licenses this file to You under the Apache License, Version 2.0
// * (the "License"); you may not use this file except in compliance with
// * the License. You may obtain a copy of the License at
// *
// * http://www.apache.org/licenses/LICENSE-2.0
// *
// * Unless required by applicable law or agreed to in writing, software
// * distributed under the License is distributed on an "AS IS" BASIS,
// * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// * See the License for the specific language governing permissions and
// * limitations under the License.
// *
// * 源代码由林良益(linliangyi2005@gmail.com)提供
// * 版权声明 2012,乌龙茶工作室
// * provided by Linliangyi and copyright 2012 by Oolong studio
// *
// */
//package org.wltea.analyzer.query;
//
//import java.io.IOException;
//import java.io.StringReader;
//import java.util.ArrayList;
//import java.util.List;
//
//import org.apache.lucene.analysis.standard.StandardAnalyzer;
//import org.apache.lucene.queryparser.classic.ParseException;
//import org.apache.lucene.queryparser.classic.QueryParser;
//import org.apache.lucene.search.Query;
//import org.apache.lucene.util.Version;
//import org.wltea.analyzer.core.IKSegmenter;
//import org.wltea.analyzer.core.Lexeme;
//
///**
// * Single Word Multi Char Query Builder
// * IK分词算法专用
// * @author linliangyi
// *
// */
//public class SWMCQueryBuilder {
//
// /**
// * 生成SWMCQuery
// * @param fieldName
// * @param keywords
// * @param quickMode
// * @return Lucene Query
// */
// public static Query create(String fieldName ,String keywords , boolean quickMode){
// if(fieldName == null || keywords == null){
// throw new IllegalArgumentException("参数 fieldName 、 keywords 不能为null.");
// }
// //1.对keywords进行分词处理
// List<Lexeme> lexemes = doAnalyze(keywords);
// //2.根据分词结果,生成SWMCQuery
// Query _SWMCQuery = getSWMCQuery(fieldName , lexemes , quickMode);
// return _SWMCQuery;
// }
//
// /**
// * 分词切分,并返回结链表
// * @param keywords
// * @return
// */
// private static List<Lexeme> doAnalyze(String keywords){
// List<Lexeme> lexemes = new ArrayList<Lexeme>();
// IKSegmenter ikSeg = new IKSegmenter(new StringReader(keywords) , true);
// try{
// Lexeme l = null;
// while( (l = ikSeg.next()) != null){
// lexemes.add(l);
// }
// }catch(IOException e){
// e.printStackTrace();
// }
// return lexemes;
// }
//
//
// /**
// * 根据分词结果生成SWMC搜索
// * @param fieldName
// * @param pathOption
// * @param quickMode
// * @return
// */
// private static Query getSWMCQuery(String fieldName , List<Lexeme> lexemes , boolean quickMode){
// //构造SWMC的查询表达式
// StringBuffer keywordBuffer = new StringBuffer();
// //精简的SWMC的查询表达式
// StringBuffer keywordBuffer_Short = new StringBuffer();
// //记录最后词元长度
// int lastLexemeLength = 0;
// //记录最后词元结束位置
// int lastLexemeEnd = -1;
//
// int shortCount = 0;
// int totalCount = 0;
// for(Lexeme l : lexemes){
// totalCount += l.getLength();
// //精简表达式
// if(l.getLength() > 1){
// keywordBuffer_Short.append(' ').append(l.getLexemeText());
// shortCount += l.getLength();
// }
//
// if(lastLexemeLength == 0){
// keywordBuffer.append(l.getLexemeText());
// }else if(lastLexemeLength == 1 && l.getLength() == 1
// && lastLexemeEnd == l.getBeginPosition()){//单字位置相邻,长度为一,合并)
// keywordBuffer.append(l.getLexemeText());
// }else{
// keywordBuffer.append(' ').append(l.getLexemeText());
//
// }
// lastLexemeLength = l.getLength();
// lastLexemeEnd = l.getEndPosition();
// }
//
// //借助lucene queryparser 生成SWMC Query
// QueryParser qp = new QueryParser(Version.LUCENE_40, fieldName, new StandardAnalyzer(Version.LUCENE_40));
// qp.setDefaultOperator(QueryParser.AND_OPERATOR);
// qp.setAutoGeneratePhraseQueries(true);
//
// if(quickMode && (shortCount * 1.0f / totalCount) > 0.5f){
// try {
// //System.out.println(keywordBuffer.toString());
// Query q = qp.parse(keywordBuffer_Short.toString());
// return q;
// } catch (ParseException e) {
// e.printStackTrace();
// }
//
// }else{
// if(keywordBuffer.length() > 0){
// try {
// //System.out.println(keywordBuffer.toString());
// Query q = qp.parse(keywordBuffer.toString());
// return q;
// } catch (ParseException e) {
// e.printStackTrace();
// }
// }
// }
// return null;
// }
//}
src/main/java/org/wltea/analyzer/sample/IKAnalzyerDemo.java
0 → 100644
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/sample/LuceneIndexAndSearchDemo.java
0 → 100644
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/seg/CJKSegmenter.java
已删除
100644 → 0
浏览文件 @
a6ed160a
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/seg/ISegmenter.java
已删除
100644 → 0
浏览文件 @
a6ed160a
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/seg/LetterSegmenter.java
已删除
100644 → 0
浏览文件 @
a6ed160a
此差异已折叠。
点击以展开。
src/main/java/org/wltea/analyzer/seg/QuantifierSegmenter.java
已删除
100644 → 0
浏览文件 @
a6ed160a
此差异已折叠。
点击以展开。
src/test/java/DictionaryTester.java
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/test/java/IKAnalyzerDemo.java
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/test/java/IKTokenerTest.java
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/test/java/SegmentorTester.java
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
src/test/java/extended/ik_dict/ext_stopwords/ext_stopword.dic
浏览文件 @
7d91be50
此差异已折叠。
点击以展开。
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录