提交 7d91be50 编写于 作者: weixin_43283383's avatar weixin_43283383

udpate ik to last version,made mode selectable

上级 a6ed160a
......@@ -19,10 +19,14 @@ In order to install the plugin, simply run:
<pre>
cd bin
plugin -install medcl/elasticsearch-analysis-ik/1.1.3
<del>plugin -install medcl/elasticsearch-analysis-ik/1.1.3</del>
</pre>
also download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/ik
<del>also download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/ik</del>
now you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf)
https://github.com/medcl/elasticsearch-rtf/tree/master/elasticsearch/plugins/analysis-ik
https://github.com/medcl/elasticsearch-rtf/tree/master/elasticsearch/config/ik
<pre>
cd config
......@@ -62,12 +66,17 @@ index:
ik:
alias: [ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
ik_smart:
type: ik
use_smart: true
</pre>
Or
<pre>
index.analysis.analyzer.ik.type : "ik"
</pre>
you can set your prefer segment mode,default `use_smart` is false.
Mapping Configuration
-------------
......
一一列举
A股
B股
AB股
H股
K线
QQ宠物
QQ飞车
U盘
Hold住
一一列举
一一对应
一一道来
一丁
......@@ -5334,12 +5343,10 @@
不买
不买账
不乱
不了
不了不当
不了了之
不了情
不了而了
不了解
不予
不予承认
不予理睬
......@@ -10118,7 +10125,6 @@
个别辅导
个协
个唱
个大
个头
个头儿
个子
......@@ -13619,6 +13625,7 @@
乌龙
乌龙球
乌龙茶
乌龙茶工作室
乌龙院
乌龙驹
乌龟
......@@ -20471,6 +20478,7 @@
仕宦
仕进
仕途
他乡
他乡人
他乡异县
......@@ -21047,7 +21055,6 @@
以其
以其人之道
以其人之道还治其人之身
以其人之道,还治其人之身
以其昏昏
以其昏昏使人昭昭
以其真正形式付款
......@@ -21261,7 +21268,7 @@
以父之名
以牙还牙
以狸至鼠
狸致鼠、以冰致绳
以冰致绳
以狸饵鼠
以玉抵乌
以玉抵鹊
......@@ -24053,7 +24060,6 @@
住宅和
住宅小区
住宅布局
住宅建筑企划委员会
住宅建设
住宅房
住宅楼
......@@ -25055,6 +25061,7 @@
佞笑
佞臣
佟湘玉
你一言
你一言我一语
你中有我
......@@ -26323,7 +26330,6 @@
保卫人员
保卫和平
保卫国家
保卫国家主权和民族资源
保卫处
保卫工作
保卫战
......@@ -27709,7 +27715,6 @@
倜傥不羁
倜傥不群
借东风
借东风丧偶案犯护
借个
借个火
借书
......@@ -28560,6 +28565,7 @@
偕生之疾
偕老
偕行
做的
做一个
做一天和尚撞一天钟
做一套
......@@ -31887,7 +31893,6 @@
全彩屏
全心
全心全意
全心全意为人民服务
全心投入
全总
全息
......@@ -32209,7 +32214,6 @@
全面推行
全面提高
全面禁止
全面禁止和彻底销毁核武器
全面继承
全面落实
全面规划
......@@ -32984,7 +32988,6 @@
公测
公测版
公海
公海海底海床和平利用特别委员会
公海自由
公演
公然
......@@ -40772,6 +40775,7 @@
分香卖履
分驾
分龄
切上
切上去
切上来
......@@ -40781,6 +40785,8 @@
切不
切不可
切丝
切的
切得
切个
切中
切中时弊
......@@ -43344,7 +43350,6 @@
前事
前事不忘
前事不忘后事之师
前事不忘,后事之师
前五强
前些
前些天
......@@ -45840,7 +45845,6 @@
劳动厅
劳动合同
劳动和社会保障部
劳动和社会保障部部长
劳动地域分工
劳动基准
劳动基准法
......@@ -46696,7 +46700,6 @@
化学农药
化学分子
化学分析
化学分析电子能电子能谱谱学
化学剂
化学剂注入组块
化学剥蚀
......@@ -46963,7 +46966,6 @@
北京市
北京市区
北京市委
北京市新方世纪科技有限公司
北京市民
北京师范大学
北京房
......@@ -47429,7 +47431,6 @@
区分效度
区分法
区分符
区分能力倾向测验
区划
区划图
区别
......@@ -47995,22 +47996,6 @@
十员
十周
十周年
十四
十四个
十四中
十四人
十四元
十四分
十四号
十四块
十四大
十四天
十四届
十四岁
十四日
十四时
十四行
十四行诗
十回
十团
十围五攻
......@@ -48121,7 +48106,6 @@
十年教训
十年树木
十年树木百年树人
十年树木,百年树人
十年浩劫
十年生聚
十年生聚十年教训
......@@ -63598,7 +63582,6 @@
和暖
和曲
和服
和服务
和村
和林格尔
和林格尔县
......@@ -64478,7 +64461,7 @@
哑巴吃黄
哑巴吃黄莲
哑巴吃黄连
哑巴吃黄连有苦说不出
哑巴吃黄连有苦说不出
哑弹
哑梢公
哑火
......@@ -67026,7 +67009,6 @@
四出
四出戏
四出活动
四分
四分之一
四分之一波长变换器
四分之三
......@@ -67036,7 +67018,6 @@
四分五落
四分五裂
四分天下
四分开
四分法
四分钟
四分音符
......@@ -67048,34 +67029,7 @@
四化建设
四匹
四区
四十
四十一
四十一中
四十七
四十七中
四十万
四十三
四十三中
四十不惑
四十中
四十九
四十九中
四十二
四十二中
四十五
四十五中
四十八
四十八中
四十六
四十六中
四十四
四十四中
四千
四千万
四千个
四千人
四千元
四千块
四叔
四叠体
四口
......@@ -69139,7 +69093,6 @@
国防科
国防科学技术
国防科学技术委员会
国防科学技术工业委员
国防科学技术工业委员会
国防科工委
国防科技
......@@ -70263,10 +70216,6 @@
圣驾
圣骑士
圣龙魔袍
在一定历史条件下
在一定程度上
在一定范围内
在一般情况下
在一起
在一边
在三
......@@ -81739,13 +81688,11 @@
奸邪
奸险
奸雄
她上去
她上来
她下
她下去
她下来
她不
她不会
她不是
她与
......@@ -98328,7 +98275,6 @@
平可夫
平台
平台梁
平台-海岸无线电系统
平和
平和县
平喉
......@@ -111772,18 +111718,11 @@
意表
意见
意见书
意见分歧
意见反馈
意见建议
意见沟通
意见箱
意见簿
意见调查
意见调查表
意识
意识到
意识形态
意识形态领域
意识流
意译
意谓
......@@ -113117,6 +113056,7 @@
成鱼
成龙
成龙配套
我为人人
我为你
我为歌狂
......@@ -114159,6 +114099,7 @@
扉用
扉画
扉页
手三里
手上
手下
......@@ -114806,7 +114747,6 @@
打保票
打信号
打倒
打倒日本帝国主义
打假
打先锋
打光
......@@ -116630,7 +116570,6 @@
承建
承建商
承建方
承建项目
承当
承德
承德县
......@@ -116645,13 +116584,7 @@
承担
承担义务
承担人
承担责任
承担费用
承担违约赔偿责任
承担重任
承担风险
承接
承接国内外
承揽
承教
承星履草
......@@ -124773,7 +124706,6 @@
提供
提供优良服务
提供优质服务
提供午餐的走读学生
提供商
提供情报
提供援助
......@@ -124987,30 +124919,8 @@
提领
提高
提高了
提高产品质量
提高产量
提高到
提高到一个新的阶段
提高到新的阶段
提高劳动效率
提高劳动生产率
提高单位面积产量
提高工作效率
提高技术
提高效率
提高效益
提高水平
提高班
提高生产率
提高生活水平
提高素质
提高经济效益
提高经济效益为中心
提高自学
提高觉悟
提高警惕
提高认识
提高质量
插一杠子
插一脚
插上
......@@ -125029,12 +124939,9 @@
插值性质
插值逼近
插入
插入序列
插入式注水泥接箍
插入损耗
插入排序
插入方式
插入方法
插入法
插入物
插入者
......@@ -126280,7 +126187,6 @@
摩尔气体常数
摩尔热容
摩尔维亚
摩尔质量排除极限
摩尔达维亚
摩崖
摩弄
......@@ -130873,27 +130779,10 @@
文代会
文以载道
文件
文件事件
文件传输
文件传送、存取和管理
文件名
文件名扩展
文件名称
文件大小
文件夹
文件存储器
文件属性
文件批量
文件服务器
文件柜
文件格式
文件汇编
文件类型
文件精神
文件系统
文件组织
文件维护
文件翻译
文件袋
文传
文似其人
......@@ -132227,11 +132116,9 @@
新一佳
新一季
新一届
新一届中央领导集体
新一期
新一波
新一轮
新一轮军备竞赛
新一集
新丁
新三样
......@@ -132241,7 +132128,6 @@
新世界论坛
新世纪
新世纪福音战士
新世纪通行证
新东
新东安
新东家
......@@ -132294,7 +132180,6 @@
新仙剑奇侠传
新任
新任务
新任国务院副总理
新会
新会区
新会县
......@@ -133655,7 +133540,7 @@
旁观者
旁观者效应
旁观者清
旁观者清,当事者迷
当事者迷
旁证
旁证博引
旁路
......@@ -134161,7 +134046,7 @@
无可否认
无可奈何
无可奈何花落去
无可奈何花落去似曾相似燕
似曾相似燕归来
无可奉告
无可如何
无可安慰
......@@ -135407,15 +135292,7 @@
日已三竿
日币
日常
日常事务
日常工作
日常支出
日常清洁卫生管理
日常生活
日常生活型
日常用品
日常用语
日常行为
日异月新
日异月更
日异月殊
......@@ -135515,7 +135392,6 @@
日本化
日本史
日本国
日本国际贸易促进会
日本天皇
日本女
日本妞
......@@ -140213,6 +140089,7 @@
月黑杀人
月黑风高
月龄
有一利必有一弊
有一得一
有一手
......@@ -141295,7 +141172,6 @@
望谟县
望远
望远镜
望都
望都县
望门
望门寡
......@@ -142559,7 +142435,6 @@
本省人
本真
本着
本着实事求是的原则
本社
本社讯
本神
......@@ -176021,7 +175896,6 @@
独桅
独桅艇
独此一家
独此一家别无分店
独步
独步一时
独步天下
......@@ -179466,7 +179340,6 @@
生产关系
生产分离器
生产力
生产力与生产关系
生产力布局
生产劳动
生产单位
......@@ -181082,7 +180955,6 @@
电子器件
电子器材
电子回旋共振加热
电子回旋共振加热化学专业词汇
电子图书
电子地图
电子城
......@@ -184948,6 +184820,7 @@
皂隶
皂靴
皂鞋
的一确二
的人
的卡
......@@ -187254,7 +187127,6 @@
省直辖县级行政单位
省直辖行政单位
省省
省福发股份有限公司
省科委
省称
省立
......@@ -190793,23 +190665,14 @@
确守信义
确守合同
确定
确定会
确定和随机佩特里网
确定型上下文有关语言
确定性
确定性反褶积
确定时间
确定是
确定有
确定能
确实
确实会
确实可靠
确实在
确实性
确实是
确实有
确实能
确属
确山
确山县
......@@ -198198,7 +198061,6 @@
第三关
第三册
第三军
第三十
第三卷
第三只
第三台
......@@ -198274,8 +198136,6 @@
第九城市
第九天
第九届
第九届人民代表大会
第九届全国人民代表大会
第九期
第九条
第九次
......@@ -198459,21 +198319,6 @@
第几章
第几节
第几课
第十
第十一
第十一届
第十七
第十三
第十个
第十个五年计划
第十九
第十二
第十二届
第十五
第十五次全国代表大会
第十位
第十八
第十六
第十册
第十卷
第十名
......@@ -198492,7 +198337,6 @@
第十轮
第十部
第十集
第号
第四
第四个
第四产业
......@@ -227177,6 +227021,7 @@
覆雨翻云
覆鹿寻蕉
覈实
见一面
见上图
见上帝
......@@ -227214,6 +227059,8 @@
见仁见志
见仁见智
见你
见他
见她
见信
见信好
见光
......@@ -231809,6 +231656,7 @@
诳诞
诳语
诳骗
说的
说一不二
说一些
说一声
......@@ -240547,7 +240395,6 @@
软件网
软件能
软件设计
软件资产管理程序
软件资源
软件超市
软件部
......@@ -242095,11 +241942,6 @@
达克罗
达克罗宁
达到
达到一个新的水平
达到历史最高水平
达到目标
达到顶点
达到高潮
达力达
达卡
达县
......@@ -246993,33 +246835,11 @@
通迅
通过
通过了
通过会议
通过信号机
通过决议
通过去
通过参观
通过商量
通过培养
通过培训
通过外交途径进行谈判
通过学习
通过实践
通过审查
通过批评
通过教育
通过来
通过率
通过考察
通过考核
通过考试
通过能力
通过表演
通过观察
通过讨论
通过训练
通过议案
通过调查
通过鉴定
通运
通运公司
通进
......@@ -247288,11 +247108,6 @@
造恶不悛
造成
造成了
造成危害
造成堕落
造成直接经济损失
造成真空
造扣
造斜工具
造斜点
造极登峰
......@@ -250565,11 +250380,6 @@
采去
采及葑菲
采取
采取不正当手段
采取不正当的手段
采取协调行动
采取多种形式
采取措施
采回
采回去
采回来
......@@ -250624,8 +250434,6 @@
采珠
采用
采用到
采用秘密窃取的手段
采用秘密窃取的方法
采石
采石厂
采石场
......@@ -250633,9 +250441,6 @@
采矿
采矿业
采矿工
采矿工业
采矿工程
采矿方法
采矿权
采矿点
采砂船
......@@ -264178,8 +263983,7 @@
面向农村
面向基层
面向对象分析
面向对象数据库语言
面向对象的体系结构
面向对象
面向市场
面向未来
面向现代化
......@@ -270421,7 +270225,6 @@
高举深藏
高举着
高举远蹈
高举邓小平理论的伟大旗帜
高义
高义薄云
高义薄云天
......@@ -39,6 +39,7 @@
分钟
分米
......@@ -58,6 +59,7 @@
厘米
......@@ -144,7 +146,6 @@
......@@ -198,6 +199,9 @@
毫升
毫米
毫克
......
......@@ -6,7 +6,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-analysis-ik</artifactId>
<version>1.1.3</version>
<version>1.1.4</version>
<packaging>jar</packaging>
<description>IK Analyzer for ElasticSearch</description>
<inceptionYear>2009</inceptionYear>
......@@ -72,6 +72,11 @@
<version>1.3.RC2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
</dependency>
</dependencies>
<build>
......
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import org.wltea.analyzer.dic.Dictionary;
import org.wltea.analyzer.lucene.IKAnalyzer;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;
public class IkAnalyzerProvider extends AbstractIndexAnalyzerProvider<IKAnalyzer> {
private final IKAnalyzer analyzer;
......@@ -18,37 +15,19 @@ public class IkAnalyzerProvider extends AbstractIndexAnalyzerProvider<IKAnalyzer
@Inject
public IkAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
// logger = Loggers.getLogger("ik-analyzer");
//
// logger.info("[Setting] {}",settings.getAsMap().toString());
// logger.info("[Index Setting] {}",indexSettings.getAsMap().toString());
// logger.info("[Env Setting] {}",env.configFile());
analyzer=new IKAnalyzer(indexSettings);
analyzer=new IKAnalyzer(indexSettings,settings);
}
/* @Override
public String name() {
return "ik";
}
@Override
public AnalyzerScope scope() {
return AnalyzerScope.INDEX;
}*/
public IkAnalyzerProvider(Index index, Settings indexSettings, String name,
Settings settings) {
super(index, indexSettings, name, settings);
analyzer=new IKAnalyzer(indexSettings);
analyzer=new IKAnalyzer(indexSettings,settings);
}
public IkAnalyzerProvider(Index index, Settings indexSettings,
String prefixSettings, String name, Settings settings) {
super(index, indexSettings, prefixSettings, name, settings);
analyzer=new IKAnalyzer(indexSettings);
analyzer=new IKAnalyzer(indexSettings,settings);
}
......
package org.wltea.analyzer;
import java.util.HashSet;
import java.util.Set;
import org.wltea.analyzer.dic.Dictionary;
import org.wltea.analyzer.seg.ISegmenter;
public class Context{
private boolean isMaxWordLength = false;
private int buffOffset;
private int available;
private int lastAnalyzed;
private int cursor;
private char[] segmentBuff;
private Set<ISegmenter> buffLocker;
private IKSortedLinkSet lexemeSet;
Context(char[] segmentBuff , boolean isMaxWordLength){
this.isMaxWordLength = isMaxWordLength;
this.segmentBuff = segmentBuff;
this.buffLocker = new HashSet<ISegmenter>(4);
this.lexemeSet = new IKSortedLinkSet();
}
public void resetContext(){
buffLocker.clear();
lexemeSet = new IKSortedLinkSet();
buffOffset = 0;
available = 0;
lastAnalyzed = 0;
cursor = 0;
}
public boolean isMaxWordLength() {
return isMaxWordLength;
}
public void setMaxWordLength(boolean isMaxWordLength) {
this.isMaxWordLength = isMaxWordLength;
}
public int getBuffOffset() {
return buffOffset;
}
public void setBuffOffset(int buffOffset) {
this.buffOffset = buffOffset;
}
public int getLastAnalyzed() {
return lastAnalyzed;
}
public void setLastAnalyzed(int lastAnalyzed) {
this.lastAnalyzed = lastAnalyzed;
}
public int getCursor() {
return cursor;
}
public void setCursor(int cursor) {
this.cursor = cursor;
}
public void lockBuffer(ISegmenter segmenter){
this.buffLocker.add(segmenter);
}
public void unlockBuffer(ISegmenter segmenter){
this.buffLocker.remove(segmenter);
}
public boolean isBufferLocked(){
return this.buffLocker.size() > 0;
}
public int getAvailable() {
return available;
}
public void setAvailable(int available) {
this.available = available;
}
public Lexeme firstLexeme() {
return this.lexemeSet.pollFirst();
}
public Lexeme lastLexeme() {
return this.lexemeSet.pollLast();
}
public void addLexeme(Lexeme lexeme){
if(!Dictionary.isStopWord(segmentBuff , lexeme.getBegin() , lexeme.getLength())){
this.lexemeSet.addLexeme(lexeme);
}
}
public int getResultSize(){
return this.lexemeSet.size();
}
public void excludeOverlap(){
this.lexemeSet.excludeOverlap();
}
private class IKSortedLinkSet{
private Lexeme head;
private Lexeme tail;
private int size;
private IKSortedLinkSet(){
this.size = 0;
}
private void addLexeme(Lexeme lexeme){
if(this.size == 0){
this.head = lexeme;
this.tail = lexeme;
this.size++;
return;
}else{
if(this.tail.compareTo(lexeme) == 0){
return;
}else if(this.tail.compareTo(lexeme) < 0){
this.tail.setNext(lexeme);
lexeme.setPrev(this.tail);
this.tail = lexeme;
this.size++;
return;
}else if(this.head.compareTo(lexeme) > 0){
this.head.setPrev(lexeme);
lexeme.setNext(this.head);
this.head = lexeme;
this.size++;
return;
}else{
Lexeme l = this.tail;
while(l != null && l.compareTo(lexeme) > 0){
l = l.getPrev();
}
if(l.compareTo(lexeme) == 0){
return;
}else if(l.compareTo(lexeme) < 0){
lexeme.setPrev(l);
lexeme.setNext(l.getNext());
l.getNext().setPrev(lexeme);
l.setNext(lexeme);
this.size++;
return;
}
}
}
}
private Lexeme pollFirst(){
if(this.size == 1){
Lexeme first = this.head;
this.head = null;
this.tail = null;
this.size--;
return first;
}else if(this.size > 1){
Lexeme first = this.head;
this.head = first.getNext();
first.setNext(null);
this.size --;
return first;
}else{
return null;
}
}
private Lexeme pollLast(){
if(this.size == 1){
Lexeme last = this.head;
this.head = null;
this.tail = null;
this.size--;
return last;
}else if(this.size > 1){
Lexeme last = this.tail;
this.tail = last.getPrev();
last.setPrev(null);
this.size--;
return last;
}else{
return null;
}
}
private void excludeOverlap(){
if(this.size > 1){
Lexeme one = this.head;
Lexeme another = one.getNext();
do{
if(one.isOverlap(another)
&& Lexeme.TYPE_CJK_NORMAL == one.getLexemeType()
&& Lexeme.TYPE_CJK_NORMAL == another.getLexemeType()){
another = another.getNext();
one.setNext(another);
if(another != null){
another.setPrev(one);
}
this.size--;
}else{
one = another;
another = another.getNext();
}
}while(another != null);
}
}
private int size(){
return this.size;
}
}
}
/**
*
*/
package org.wltea.analyzer;
import java.io.IOException;
import java.io.Reader;
import java.util.List;
import org.wltea.analyzer.cfg.Configuration;
import org.wltea.analyzer.help.CharacterHelper;
import org.wltea.analyzer.seg.ISegmenter;
public final class IKSegmentation{
private Reader input;
private static final int BUFF_SIZE = 3072;
private static final int BUFF_EXHAUST_CRITICAL = 48;
private char[] segmentBuff;
private Context context;
private List<ISegmenter> segmenters;
public IKSegmentation(Reader input){
this(input , false);
}
public IKSegmentation(Reader input , boolean isMaxWordLength){
this.input = input ;
segmentBuff = new char[BUFF_SIZE];
context = new Context(segmentBuff , isMaxWordLength);
segmenters = Configuration.loadSegmenter();
}
public synchronized Lexeme next() throws IOException {
if(context.getResultSize() == 0){
/*
* 从reader中读取数据,填充buffer
* 如果reader是分次读入buffer的,那么buffer要进行移位处理
* 移位处理上次读入的但未处理的数据
*/
int available = fillBuffer(input);
if(available <= 0){
context.resetContext();
return null;
}else{
int buffIndex = 0;
for( ; buffIndex < available ; buffIndex++){
context.setCursor(buffIndex);
segmentBuff[buffIndex] = CharacterHelper.regularize(segmentBuff[buffIndex]);
for(ISegmenter segmenter : segmenters){
segmenter.nextLexeme(segmentBuff , context);
}
/*
* 满足一下条件时,
* 1.available == BUFF_SIZE 表示buffer满载
* 2.buffIndex < available - 1 && buffIndex > available - BUFF_EXHAUST_CRITICAL表示当前指针处于临界区内
* 3.!context.isBufferLocked()表示没有segmenter在占用buffer
* 要中断当前循环(buffer要进行移位,并再读取数据的操作)
*/
if(available == BUFF_SIZE
&& buffIndex < available - 1
&& buffIndex > available - BUFF_EXHAUST_CRITICAL
&& !context.isBufferLocked()){
break;
}
}
for(ISegmenter segmenter : segmenters){
segmenter.reset();
}
context.setLastAnalyzed(buffIndex);
context.setBuffOffset(context.getBuffOffset() + buffIndex);
if(context.isMaxWordLength()){
context.excludeOverlap();
}
return buildLexeme(context.firstLexeme());
}
}else{
return buildLexeme(context.firstLexeme());
}
}
private int fillBuffer(Reader reader) throws IOException{
int readCount = 0;
if(context.getBuffOffset() == 0){
readCount = reader.read(segmentBuff);
}else{
int offset = context.getAvailable() - context.getLastAnalyzed();
if(offset > 0){
System.arraycopy(segmentBuff , context.getLastAnalyzed() , this.segmentBuff , 0 , offset);
readCount = offset;
}
readCount += reader.read(segmentBuff , offset , BUFF_SIZE - offset);
}
context.setAvailable(readCount);
return readCount;
}
private Lexeme buildLexeme(Lexeme lexeme){
if(lexeme != null){
lexeme.setLexemeText(String.valueOf(segmentBuff , lexeme.getBegin() , lexeme.getLength()));
return lexeme;
}else{
return null;
}
}
public synchronized void reset(Reader input) {
this.input = input;
context.resetContext();
for(ISegmenter segmenter : segmenters){
segmenter.reset();
}
}
}
......@@ -7,10 +7,6 @@ import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.wltea.analyzer.seg.CJKSegmenter;
import org.wltea.analyzer.seg.ISegmenter;
import org.wltea.analyzer.seg.LetterSegmenter;
import org.wltea.analyzer.seg.QuantifierSegmenter;
import java.io.*;
import java.util.ArrayList;
......@@ -18,8 +14,6 @@ import java.util.InvalidPropertiesFormatException;
import java.util.List;
import java.util.Properties;
import static org.wltea.analyzer.dic.Dictionary.getInstance;
public class Configuration {
private static String FILE_NAME = "ik/IKAnalyzer.cfg.xml";
......@@ -27,6 +21,10 @@ public class Configuration {
private static final String EXT_STOP = "ext_stopwords";
private static ESLogger logger = null;
private Properties props;
/*
* 是否使用smart方式分词
*/
private boolean useSmart=true;
public Configuration(Settings settings){
......@@ -34,7 +32,8 @@ public class Configuration {
props = new Properties();
Environment environment=new Environment(settings);
File fileConfig= new File(environment.configFile(), FILE_NAME);
InputStream input = null;// Configuration.class.getResourceAsStream(FILE_NAME);
InputStream input = null;
try {
input = new FileInputStream(fileConfig);
} catch (FileNotFoundException e) {
......@@ -52,7 +51,27 @@ public class Configuration {
}
}
public List<String> getExtDictionarys(){
/**
* 返回useSmart标志位
* useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
* @return useSmart
*/
public boolean useSmart() {
return useSmart;
}
/**
* 设置useSmart标志位
* useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
* @param useSmart
*/
public void setUseSmart(boolean useSmart) {
this.useSmart = useSmart;
}
public List<String> getExtDictionarys(){
List<String> extDictFiles = new ArrayList<String>(2);
String extDictCfg = props.getProperty(EXT_DICT);
if(extDictCfg != null){
......@@ -89,13 +108,4 @@ public class Configuration {
}
return extStopWordDictFiles;
}
public static List<ISegmenter> loadSegmenter(){
getInstance();
List<ISegmenter> segmenters = new ArrayList<ISegmenter>(4);
segmenters.add(new QuantifierSegmenter());
segmenters.add(new LetterSegmenter());
segmenters.add(new CJKSegmenter());
return segmenters;
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
import org.wltea.analyzer.dic.Dictionary;
import java.io.IOException;
import java.io.Reader;
import java.util.*;
/**
*
* 分词器上下文状态
*
*/
class AnalyzeContext {
//默认缓冲区大小
private static final int BUFF_SIZE = 4096;
//缓冲区耗尽的临界值
private static final int BUFF_EXHAUST_CRITICAL = 100;
//字符窜读取缓冲
private char[] segmentBuff;
//字符类型数组
private int[] charTypes;
//记录Reader内已分析的字串总长度
//在分多段分析词元时,该变量累计当前的segmentBuff相对于reader起始位置的位移
private int buffOffset;
//当前缓冲区位置指针
private int cursor;
//最近一次读入的,可处理的字串长度
private int available;
//子分词器锁
//该集合非空,说明有子分词器在占用segmentBuff
private Set<String> buffLocker;
//原始分词结果集合,未经歧义处理
private QuickSortSet orgLexemes;
//LexemePath位置索引表
private Map<Integer , LexemePath> pathMap;
//最终分词结果集
private LinkedList<Lexeme> results;
//分词器配置项
private boolean useSmart;
public AnalyzeContext(boolean useSmart){
this.useSmart = useSmart;
this.segmentBuff = new char[BUFF_SIZE];
this.charTypes = new int[BUFF_SIZE];
this.buffLocker = new HashSet<String>();
this.orgLexemes = new QuickSortSet();
this.pathMap = new HashMap<Integer , LexemePath>();
this.results = new LinkedList<Lexeme>();
}
int getCursor(){
return this.cursor;
}
//
// void setCursor(int cursor){
// this.cursor = cursor;
// }
char[] getSegmentBuff(){
return this.segmentBuff;
}
char getCurrentChar(){
return this.segmentBuff[this.cursor];
}
int getCurrentCharType(){
return this.charTypes[this.cursor];
}
int getBufferOffset(){
return this.buffOffset;
}
/**
* 根据context的上下文情况,填充segmentBuff
* @param reader
* @return 返回待分析的(有效的)字串长度
* @throws IOException
*/
int fillBuffer(Reader reader) throws IOException{
int readCount = 0;
if(this.buffOffset == 0){
//首次读取reader
readCount = reader.read(segmentBuff);
}else{
int offset = this.available - this.cursor;
if(offset > 0){
//最近一次读取的>最近一次处理的,将未处理的字串拷贝到segmentBuff头部
System.arraycopy(this.segmentBuff , this.cursor , this.segmentBuff , 0 , offset);
readCount = offset;
}
//继续读取reader ,以onceReadIn - onceAnalyzed为起始位置,继续填充segmentBuff剩余的部分
readCount += reader.read(this.segmentBuff , offset , BUFF_SIZE - offset);
}
//记录最后一次从Reader中读入的可用字符长度
this.available = readCount;
//重置当前指针
this.cursor = 0;
return readCount;
}
/**
* 初始化buff指针,处理第一个字符
*/
void initCursor(){
this.cursor = 0;
this.segmentBuff[this.cursor] = CharacterUtil.regularize(this.segmentBuff[this.cursor]);
this.charTypes[this.cursor] = CharacterUtil.identifyCharType(this.segmentBuff[this.cursor]);
}
/**
* 指针+1
* 成功返回 true; 指针已经到了buff尾部,不能前进,返回false
* 并处理当前字符
*/
boolean moveCursor(){
if(this.cursor < this.available - 1){
this.cursor++;
this.segmentBuff[this.cursor] = CharacterUtil.regularize(this.segmentBuff[this.cursor]);
this.charTypes[this.cursor] = CharacterUtil.identifyCharType(this.segmentBuff[this.cursor]);
return true;
}else{
return false;
}
}
/**
* 设置当前segmentBuff为锁定状态
* 加入占用segmentBuff的子分词器名称,表示占用segmentBuff
* @param segmenterName
*/
void lockBuffer(String segmenterName){
this.buffLocker.add(segmenterName);
}
/**
* 移除指定的子分词器名,释放对segmentBuff的占用
* @param segmenterName
*/
void unlockBuffer(String segmenterName){
this.buffLocker.remove(segmenterName);
}
/**
* 只要buffLocker中存在segmenterName
* 则buffer被锁定
* @return boolean 缓冲去是否被锁定
*/
boolean isBufferLocked(){
return this.buffLocker.size() > 0;
}
/**
* 判断当前segmentBuff是否已经用完
* 当前执针cursor移至segmentBuff末端this.available - 1
* @return
*/
boolean isBufferConsumed(){
return this.cursor == this.available - 1;
}
/**
* 判断segmentBuff是否需要读取新数据
*
* 满足一下条件时,
* 1.available == BUFF_SIZE 表示buffer满载
* 2.buffIndex < available - 1 && buffIndex > available - BUFF_EXHAUST_CRITICAL表示当前指针处于临界区内
* 3.!context.isBufferLocked()表示没有segmenter在占用buffer
* 要中断当前循环(buffer要进行移位,并再读取数据的操作)
* @return
*/
boolean needRefillBuffer(){
return this.available == BUFF_SIZE
&& this.cursor < this.available - 1
&& this.cursor > this.available - BUFF_EXHAUST_CRITICAL
&& !this.isBufferLocked();
}
/**
* 累计当前的segmentBuff相对于reader起始位置的位移
*/
void markBufferOffset(){
this.buffOffset += this.cursor;
}
/**
* 向分词结果集添加词元
* @param lexeme
*/
void addLexeme(Lexeme lexeme){
this.orgLexemes.addLexeme(lexeme);
}
/**
* 添加分词结果路径
* 路径起始位置 ---> 路径 映射表
* @param path
*/
void addLexemePath(LexemePath path){
if(path != null){
this.pathMap.put(path.getPathBegin(), path);
}
}
/**
* 返回原始分词结果
* @return
*/
QuickSortSet getOrgLexemes(){
return this.orgLexemes;
}
/**
* 推送分词结果到结果集合
* 1.从buff头部遍历到this.cursor已处理位置
* 2.将map中存在的分词结果推入results
* 3.将map中不存在的CJDK字符以单字方式推入results
*/
void outputToResult(){
int index = 0;
for( ; index <= this.cursor ;){
//跳过非CJK字符
if(CharacterUtil.CHAR_USELESS == this.charTypes[index]){
index++;
continue;
}
//从pathMap找出对应index位置的LexemePath
LexemePath path = this.pathMap.get(index);
if(path != null){
//输出LexemePath中的lexeme到results集合
Lexeme l = path.pollFirst();
while(l != null){
this.results.add(l);
//将index移至lexeme后
index = l.getBegin() + l.getLength();
l = path.pollFirst();
if(l != null){
//输出path内部,词元间遗漏的单字
for(;index < l.getBegin();index++){
this.outputSingleCJK(index);
}
}
}
}else{//pathMap中找不到index对应的LexemePath
//单字输出
this.outputSingleCJK(index);
index++;
}
}
//清空当前的Map
this.pathMap.clear();
}
/**
* 对CJK字符进行单字输出
* @param index
*/
private void outputSingleCJK(int index){
if(CharacterUtil.CHAR_CHINESE == this.charTypes[index]){
Lexeme singleCharLexeme = new Lexeme(this.buffOffset , index , 1 , Lexeme.TYPE_CNCHAR);
this.results.add(singleCharLexeme);
}else if(CharacterUtil.CHAR_OTHER_CJK == this.charTypes[index]){
Lexeme singleCharLexeme = new Lexeme(this.buffOffset , index , 1 , Lexeme.TYPE_OTHER_CJK);
this.results.add(singleCharLexeme);
}
}
/**
* 返回lexeme
*
* 同时处理合并
* @return
*/
Lexeme getNextLexeme(){
//从结果集取出,并移除第一个Lexme
Lexeme result = this.results.pollFirst();
while(result != null){
//数量词合并
this.compound(result);
if(Dictionary.isStopWord(this.segmentBuff , result.getBegin() , result.getLength())){
//是停止词继续取列表的下一个
result = this.results.pollFirst();
}else{
//不是停止词, 生成lexeme的词元文本,输出
result.setLexemeText(String.valueOf(segmentBuff , result.getBegin() , result.getLength()));
break;
}
}
return result;
}
/**
* 重置分词上下文状态
*/
void reset(){
this.buffLocker.clear();
this.orgLexemes = new QuickSortSet();
this.available =0;
this.buffOffset = 0;
this.charTypes = new int[BUFF_SIZE];
this.cursor = 0;
this.results.clear();
this.segmentBuff = new char[BUFF_SIZE];
this.pathMap.clear();
}
/**
* 组合词元
*/
private void compound(Lexeme result){
if(!this.useSmart){
return ;
}
//数量词合并处理
if(!this.results.isEmpty()){
if(Lexeme.TYPE_ARABIC == result.getLexemeType()){
Lexeme nextLexeme = this.results.peekFirst();
boolean appendOk = false;
if(Lexeme.TYPE_CNUM == nextLexeme.getLexemeType()){
//合并英文数词+中文数词
appendOk = result.append(nextLexeme, Lexeme.TYPE_CNUM);
}else if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
//合并英文数词+中文量词
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
}
if(appendOk){
//弹出
this.results.pollFirst();
}
}
//可能存在第二轮合并
if(Lexeme.TYPE_CNUM == result.getLexemeType() && !this.results.isEmpty()){
Lexeme nextLexeme = this.results.peekFirst();
boolean appendOk = false;
if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
//合并中文数词+中文量词
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
}
if(appendOk){
//弹出
this.results.pollFirst();
}
}
}
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
import org.wltea.analyzer.dic.Dictionary;
import org.wltea.analyzer.dic.Hit;
import java.util.LinkedList;
import java.util.List;
/**
* 中文-日韩文子分词器
*/
class CJKSegmenter implements ISegmenter {
//子分词器标签
static final String SEGMENTER_NAME = "CJK_SEGMENTER";
//待处理的分词hit队列
private List<Hit> tmpHits;
CJKSegmenter(){
this.tmpHits = new LinkedList<Hit>();
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
*/
public void analyze(AnalyzeContext context) {
if(CharacterUtil.CHAR_USELESS != context.getCurrentCharType()){
//优先处理tmpHits中的hit
if(!this.tmpHits.isEmpty()){
//处理词段队列
Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);
for(Hit hit : tmpArray){
hit = Dictionary.matchInMainDictWithHit(context.getSegmentBuff(), context.getCursor() , hit);
if(hit.isMatch()){
//输出当前的词
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);
context.addLexeme(newLexeme);
if(!hit.isPrefix()){//不是词前缀,hit不需要继续匹配,移除
this.tmpHits.remove(hit);
}
}else if(hit.isUnmatch()){
//hit不是词,移除
this.tmpHits.remove(hit);
}
}
}
//*********************************
//再对当前指针位置的字符进行单字匹配
Hit singleCharHit = Dictionary.matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
if(singleCharHit.isMatch()){//首字成词
//输出当前的词
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);
context.addLexeme(newLexeme);
//同时也是词前缀
if(singleCharHit.isPrefix()){
//前缀匹配则放入hit列表
this.tmpHits.add(singleCharHit);
}
}else if(singleCharHit.isPrefix()){//首字为词前缀
//前缀匹配则放入hit列表
this.tmpHits.add(singleCharHit);
}
}else{
//遇到CHAR_USELESS字符
//清空队列
this.tmpHits.clear();
}
//判断缓冲区是否已经读完
if(context.isBufferConsumed()){
//清空队列
this.tmpHits.clear();
}
//判断是否锁定缓冲区
if(this.tmpHits.size() == 0){
context.unlockBuffer(SEGMENTER_NAME);
}else{
context.lockBuffer(SEGMENTER_NAME);
}
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#reset()
*/
public void reset() {
//清空队列
this.tmpHits.clear();
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
import org.wltea.analyzer.dic.Dictionary;
import org.wltea.analyzer.dic.Hit;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;
/**
*
* 中文数量词子分词器
*/
class CN_QuantifierSegmenter implements ISegmenter{
//子分词器标签
static final String SEGMENTER_NAME = "QUAN_SEGMENTER";
//中文数词
private static String Chn_Num = "一二两三四五六七八九十零壹贰叁肆伍陆柒捌玖拾百千万亿拾佰仟萬億兆卅廿";//Cnum
private static Set<Character> ChnNumberChars = new HashSet<Character>();
static{
char[] ca = Chn_Num.toCharArray();
for(char nChar : ca){
ChnNumberChars.add(nChar);
}
}
/*
* 词元的开始位置,
* 同时作为子分词器状态标识
* 当start > -1 时,标识当前的分词器正在处理字符
*/
private int nStart;
/*
* 记录词元结束位置
* end记录的是在词元中最后一个出现的合理的数词结束
*/
private int nEnd;
//待处理的量词hit队列
private List<Hit> countHits;
CN_QuantifierSegmenter(){
nStart = -1;
nEnd = -1;
this.countHits = new LinkedList<Hit>();
}
/**
* 分词
*/
public void analyze(AnalyzeContext context) {
//处理中文数词
this.processCNumber(context);
//处理中文量词
this.processCount(context);
//判断是否锁定缓冲区
if(this.nStart == -1 && this.nEnd == -1 && countHits.isEmpty()){
//对缓冲区解锁
context.unlockBuffer(SEGMENTER_NAME);
}else{
context.lockBuffer(SEGMENTER_NAME);
}
}
/**
* 重置子分词器状态
*/
public void reset() {
nStart = -1;
nEnd = -1;
countHits.clear();
}
/**
* 处理数词
*/
private void processCNumber(AnalyzeContext context){
if(nStart == -1 && nEnd == -1){//初始状态
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
&& ChnNumberChars.contains(context.getCurrentChar())){
//记录数词的起始、结束位置
nStart = context.getCursor();
nEnd = context.getCursor();
}
}else{//正在处理状态
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
&& ChnNumberChars.contains(context.getCurrentChar())){
//记录数词的结束位置
nEnd = context.getCursor();
}else{
//输出数词
this.outputNumLexeme(context);
//重置头尾指针
nStart = -1;
nEnd = -1;
}
}
//缓冲区已经用完,还有尚未输出的数词
if(context.isBufferConsumed()){
if(nStart != -1 && nEnd != -1){
//输出数词
outputNumLexeme(context);
//重置头尾指针
nStart = -1;
nEnd = -1;
}
}
}
/**
* 处理中文量词
* @param context
*/
private void processCount(AnalyzeContext context){
// 判断是否需要启动量词扫描
if(!this.needCountScan(context)){
return;
}
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()){
//优先处理countHits中的hit
if(!this.countHits.isEmpty()){
//处理词段队列
Hit[] tmpArray = this.countHits.toArray(new Hit[this.countHits.size()]);
for(Hit hit : tmpArray){
hit = Dictionary.matchInMainDictWithHit(context.getSegmentBuff(), context.getCursor() , hit);
if(hit.isMatch()){
//输出当前的词
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_COUNT);
context.addLexeme(newLexeme);
if(!hit.isPrefix()){//不是词前缀,hit不需要继续匹配,移除
this.countHits.remove(hit);
}
}else if(hit.isUnmatch()){
//hit不是词,移除
this.countHits.remove(hit);
}
}
}
//*********************************
//对当前指针位置的字符进行单字匹配
Hit singleCharHit = Dictionary.matchInQuantifierDict(context.getSegmentBuff(), context.getCursor(), 1);
if(singleCharHit.isMatch()){//首字成量词词
//输出当前的词
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_COUNT);
context.addLexeme(newLexeme);
//同时也是词前缀
if(singleCharHit.isPrefix()){
//前缀匹配则放入hit列表
this.countHits.add(singleCharHit);
}
}else if(singleCharHit.isPrefix()){//首字为量词前缀
//前缀匹配则放入hit列表
this.countHits.add(singleCharHit);
}
}else{
//输入的不是中文字符
//清空未成形的量词
this.countHits.clear();
}
//缓冲区数据已经读完,还有尚未输出的量词
if(context.isBufferConsumed()){
//清空未成形的量词
this.countHits.clear();
}
}
/**
* 判断是否需要扫描量词
* @return
*/
private boolean needCountScan(AnalyzeContext context){
if((nStart != -1 && nEnd != -1 ) || !countHits.isEmpty()){
//正在处理中文数词,或者正在处理量词
return true;
}else{
//找到一个相邻的数词
if(!context.getOrgLexemes().isEmpty()){
Lexeme l = context.getOrgLexemes().peekLast();
if(Lexeme.TYPE_CNUM == l.getLexemeType() || Lexeme.TYPE_ARABIC == l.getLexemeType()){
if(l.getBegin() + l.getLength() == context.getCursor()){
return true;
}
}
}
}
return false;
}
/**
* 添加数词词元到结果集
* @param context
*/
private void outputNumLexeme(AnalyzeContext context){
if(nStart > -1 && nEnd > -1){
//输出数词
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , nStart , nEnd - nStart + 1 , Lexeme.TYPE_CNUM);
context.addLexeme(newLexeme);
}
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
* 字符集识别工具类
*/
package org.wltea.analyzer.core;
/**
*
* 字符集识别工具类
*/
class CharacterUtil {
public static final int CHAR_USELESS = 0;
public static final int CHAR_ARABIC = 0X00000001;
public static final int CHAR_ENGLISH = 0X00000002;
public static final int CHAR_CHINESE = 0X00000004;
public static final int CHAR_OTHER_CJK = 0X00000008;
/**
* 识别字符类型
* @param input
* @return int CharacterUtil定义的字符类型常量
*/
static int identifyCharType(char input){
if(input >= '0' && input <= '9'){
return CHAR_ARABIC;
}else if((input >= 'a' && input <= 'z')
|| (input >= 'A' && input <= 'Z')){
return CHAR_ENGLISH;
}else {
Character.UnicodeBlock ub = Character.UnicodeBlock.of(input);
if(ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
|| ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|| ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A){
//目前已知的中文字符UTF-8集合
return CHAR_CHINESE;
}else if(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS //全角数字字符和日韩字符
//韩文字符集
|| ub == Character.UnicodeBlock.HANGUL_SYLLABLES
|| ub == Character.UnicodeBlock.HANGUL_JAMO
|| ub == Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO
//日文字符集
|| ub == Character.UnicodeBlock.HIRAGANA //平假名
|| ub == Character.UnicodeBlock.KATAKANA //片假名
|| ub == Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS){
return CHAR_OTHER_CJK;
}
}
//其他的不做处理的字符
return CHAR_USELESS;
}
/**
* 进行字符规格化(全角转半角,大写转小写处理)
* @param input
* @return char
*/
static char regularize(char input){
if (input == 12288) {
input = (char) 32;
}else if (input > 65280 && input < 65375) {
input = (char) (input - 65248);
}else if (input >= 'A' && input <= 'Z') {
input += 32;
}
return input;
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
import java.util.Stack;
import java.util.TreeSet;
/**
* IK分词歧义裁决器
*/
class IKArbitrator {
IKArbitrator(){
}
/**
* 分词歧义处理
* @param orgLexemes
* @param useSmart
*/
void process(AnalyzeContext context , boolean useSmart){
QuickSortSet orgLexemes = context.getOrgLexemes();
Lexeme orgLexeme = orgLexemes.pollFirst();
LexemePath crossPath = new LexemePath();
while(orgLexeme != null){
if(!crossPath.addCrossLexeme(orgLexeme)){
//找到与crossPath不相交的下一个crossPath
if(crossPath.size() == 1 || !useSmart){
//crossPath没有歧义 或者 不做歧义处理
//直接输出当前crossPath
context.addLexemePath(crossPath);
}else{
//对当前的crossPath进行歧义处理
QuickSortSet.Cell headCell = crossPath.getHead();
LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength());
//输出歧义处理结果judgeResult
context.addLexemePath(judgeResult);
}
//把orgLexeme加入新的crossPath中
crossPath = new LexemePath();
crossPath.addCrossLexeme(orgLexeme);
}
orgLexeme = orgLexemes.pollFirst();
}
//处理最后的path
if(crossPath.size() == 1 || !useSmart){
//crossPath没有歧义 或者 不做歧义处理
//直接输出当前crossPath
context.addLexemePath(crossPath);
}else{
//对当前的crossPath进行歧义处理
QuickSortSet.Cell headCell = crossPath.getHead();
LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength());
//输出歧义处理结果judgeResult
context.addLexemePath(judgeResult);
}
}
/**
* 歧义识别
* @param lexemeCell 歧义路径链表头
* @param fullTextLength 歧义路径文本长度
* @param option 候选结果路径
* @return
*/
private LexemePath judge(QuickSortSet.Cell lexemeCell , int fullTextLength){
//候选路径集合
TreeSet<LexemePath> pathOptions = new TreeSet<LexemePath>();
//候选结果路径
LexemePath option = new LexemePath();
//对crossPath进行一次遍历,同时返回本次遍历中有冲突的Lexeme栈
Stack<QuickSortSet.Cell> lexemeStack = this.forwardPath(lexemeCell , option);
//当前词元链并非最理想的,加入候选路径集合
pathOptions.add(option.copy());
//存在歧义词,处理
QuickSortSet.Cell c = null;
while(!lexemeStack.isEmpty()){
c = lexemeStack.pop();
//回滚词元链
this.backPath(c.getLexeme() , option);
//从歧义词位置开始,递归,生成可选方案
this.forwardPath(c , option);
pathOptions.add(option.copy());
}
//返回集合中的最优方案
return pathOptions.first();
}
/**
* 向前遍历,添加词元,构造一个无歧义词元组合
* @param LexemePath path
* @return
*/
private Stack<QuickSortSet.Cell> forwardPath(QuickSortSet.Cell lexemeCell , LexemePath option){
//发生冲突的Lexeme栈
Stack<QuickSortSet.Cell> conflictStack = new Stack<QuickSortSet.Cell>();
QuickSortSet.Cell c = lexemeCell;
//迭代遍历Lexeme链表
while(c != null && c.getLexeme() != null){
if(!option.addNotCrossLexeme(c.getLexeme())){
//词元交叉,添加失败则加入lexemeStack栈
conflictStack.push(c);
}
c = c.getNext();
}
return conflictStack;
}
/**
* 回滚词元链,直到它能够接受指定的词元
* @param lexeme
* @param l
*/
private void backPath(Lexeme l , LexemePath option){
while(option.checkCross(l)){
option.removeTail();
}
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*/
package org.wltea.analyzer.core;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
/**
* IK分词器主类
*
*/
public final class IKSegmenter {
//字符窜reader
private Reader input;
//分词器上下文
private AnalyzeContext context;
//分词处理器列表
private List<ISegmenter> segmenters;
//分词歧义裁决器
private IKArbitrator arbitrator;
private ESLogger logger=null;
private final boolean useSmart;
/**
* IK分词器构造函数
* @param input
* @param useSmart 为true,使用智能分词策略
*
* 非智能分词:细粒度输出所有可能的切分结果
* 智能分词: 合并数词和量词,对分词结果进行歧义判断
*/
public IKSegmenter(Reader input , boolean useSmart){
logger = Loggers.getLogger("ik-analyzer");
this.input = input;
this.useSmart=useSmart;
this.init();
}
/**
* 初始化
*/
private void init(){
//初始化分词上下文
this.context = new AnalyzeContext(useSmart);
//加载子分词器
this.segmenters = this.loadSegmenters();
//加载歧义裁决器
this.arbitrator = new IKArbitrator();
}
/**
* 初始化词典,加载子分词器实现
* @return List<ISegmenter>
*/
private List<ISegmenter> loadSegmenters(){
List<ISegmenter> segmenters = new ArrayList<ISegmenter>(4);
//处理字母的子分词器
segmenters.add(new LetterSegmenter());
//处理中文数量词的子分词器
segmenters.add(new CN_QuantifierSegmenter());
//处理中文词的子分词器
segmenters.add(new CJKSegmenter());
return segmenters;
}
/**
* 分词,获取下一个词元
* @return Lexeme 词元对象
* @throws IOException
*/
public synchronized Lexeme next()throws IOException{
Lexeme l = null;
while((l = context.getNextLexeme()) == null ){
/*
* 从reader中读取数据,填充buffer
* 如果reader是分次读入buffer的,那么buffer要 进行移位处理
* 移位处理上次读入的但未处理的数据
*/
int available = context.fillBuffer(this.input);
if(available <= 0){
//reader已经读完
context.reset();
return null;
}else{
//初始化指针
context.initCursor();
do{
//遍历子分词器
for(ISegmenter segmenter : segmenters){
segmenter.analyze(context);
}
//字符缓冲区接近读完,需要读入新的字符
if(context.needRefillBuffer()){
break;
}
//向前移动指针
}while(context.moveCursor());
//重置子分词器,为下轮循环进行初始化
for(ISegmenter segmenter : segmenters){
segmenter.reset();
}
}
//对分词进行歧义处理
logger.error("useSmart:"+String.valueOf(useSmart));
this.arbitrator.process(context, useSmart);
//将分词结果输出到结果集,并处理未切分的单个CJK字符
context.outputToResult();
//记录本次分词的缓冲区位移
context.markBufferOffset();
}
return l;
}
/**
* 重置分词器到初始状态
* @param input
*/
public synchronized void reset(Reader input) {
this.input = input;
context.reset();
for(ISegmenter segmenter : segmenters){
segmenter.reset();
}
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
/**
*
* 子分词器接口
*/
interface ISegmenter {
/**
* 从分析器读取下一个可能分解的词元对象
* @param context 分词算法上下文
*/
void analyze(AnalyzeContext context);
/**
* 重置子分析器状态
*/
void reset();
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
import java.util.Arrays;
/**
*
* 英文字符及阿拉伯数字子分词器
*/
class LetterSegmenter implements ISegmenter {
//子分词器标签
static final String SEGMENTER_NAME = "LETTER_SEGMENTER";
//链接符号
private static final char[] Letter_Connector = new char[]{'#' , '&' , '+' , '-' , '.' , '@' , '_'};
//数字符号
private static final char[] Num_Connector = new char[]{',' , '.'};
/*
* 词元的开始位置,
* 同时作为子分词器状态标识
* 当start > -1 时,标识当前的分词器正在处理字符
*/
private int start;
/*
* 记录词元结束位置
* end记录的是在词元中最后一个出现的Letter但非Sign_Connector的字符的位置
*/
private int end;
/*
* 字母起始位置
*/
private int englishStart;
/*
* 字母结束位置
*/
private int englishEnd;
/*
* 阿拉伯数字起始位置
*/
private int arabicStart;
/*
* 阿拉伯数字结束位置
*/
private int arabicEnd;
LetterSegmenter(){
Arrays.sort(Letter_Connector);
Arrays.sort(Num_Connector);
this.start = -1;
this.end = -1;
this.englishStart = -1;
this.englishEnd = -1;
this.arabicStart = -1;
this.arabicEnd = -1;
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
*/
public void analyze(AnalyzeContext context) {
boolean bufferLockFlag = false;
//处理英文字母
bufferLockFlag = this.processEnglishLetter(context) || bufferLockFlag;
//处理阿拉伯字母
bufferLockFlag = this.processArabicLetter(context) || bufferLockFlag;
//处理混合字母(这个要放最后处理,可以通过QuickSortSet排除重复)
bufferLockFlag = this.processMixLetter(context) || bufferLockFlag;
//判断是否锁定缓冲区
if(bufferLockFlag){
context.lockBuffer(SEGMENTER_NAME);
}else{
//对缓冲区解锁
context.unlockBuffer(SEGMENTER_NAME);
}
}
/* (non-Javadoc)
* @see org.wltea.analyzer.core.ISegmenter#reset()
*/
public void reset() {
this.start = -1;
this.end = -1;
this.englishStart = -1;
this.englishEnd = -1;
this.arabicStart = -1;
this.arabicEnd = -1;
}
/**
* 处理数字字母混合输出
* 如:windos2000 | linliangyi2005@gmail.com
* @param input
* @param context
* @return
*/
private boolean processMixLetter(AnalyzeContext context){
boolean needLock = false;
if(this.start == -1){//当前的分词器尚未开始处理字符
if(CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()){
//记录起始指针的位置,标明分词器进入处理状态
this.start = context.getCursor();
this.end = start;
}
}else{//当前的分词器正在处理字符
if(CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()){
//记录下可能的结束位置
this.end = context.getCursor();
}else if(CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
&& this.isLetterConnector(context.getCurrentChar())){
//记录下可能的结束位置
this.end = context.getCursor();
}else{
//遇到非Letter字符,输出词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.start , this.end - this.start + 1 , Lexeme.TYPE_LETTER);
context.addLexeme(newLexeme);
this.start = -1;
this.end = -1;
}
}
//判断缓冲区是否已经读完
if(context.isBufferConsumed()){
if(this.start != -1 && this.end != -1){
//缓冲以读完,输出词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.start , this.end - this.start + 1 , Lexeme.TYPE_LETTER);
context.addLexeme(newLexeme);
this.start = -1;
this.end = -1;
}
}
//判断是否锁定缓冲区
if(this.start == -1 && this.end == -1){
//对缓冲区解锁
needLock = false;
}else{
needLock = true;
}
return needLock;
}
/**
* 处理纯英文字母输出
* @param context
* @return
*/
private boolean processEnglishLetter(AnalyzeContext context){
boolean needLock = false;
if(this.englishStart == -1){//当前的分词器尚未开始处理英文字符
if(CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()){
//记录起始指针的位置,标明分词器进入处理状态
this.englishStart = context.getCursor();
this.englishEnd = this.englishStart;
}
}else {//当前的分词器正在处理英文字符
if(CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()){
//记录当前指针位置为结束位置
this.englishEnd = context.getCursor();
}else{
//遇到非English字符,输出词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.englishStart , this.englishEnd - this.englishStart + 1 , Lexeme.TYPE_ENGLISH);
context.addLexeme(newLexeme);
this.englishStart = -1;
this.englishEnd= -1;
}
}
//判断缓冲区是否已经读完
if(context.isBufferConsumed()){
if(this.englishStart != -1 && this.englishEnd != -1){
//缓冲以读完,输出词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.englishStart , this.englishEnd - this.englishStart + 1 , Lexeme.TYPE_ENGLISH);
context.addLexeme(newLexeme);
this.englishStart = -1;
this.englishEnd= -1;
}
}
//判断是否锁定缓冲区
if(this.englishStart == -1 && this.englishEnd == -1){
//对缓冲区解锁
needLock = false;
}else{
needLock = true;
}
return needLock;
}
/**
* 处理阿拉伯数字输出
* @param context
* @return
*/
private boolean processArabicLetter(AnalyzeContext context){
boolean needLock = false;
if(this.arabicStart == -1){//当前的分词器尚未开始处理数字字符
if(CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()){
//记录起始指针的位置,标明分词器进入处理状态
this.arabicStart = context.getCursor();
this.arabicEnd = this.arabicStart;
}
}else {//当前的分词器正在处理数字字符
if(CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()){
//记录当前指针位置为结束位置
this.arabicEnd = context.getCursor();
}else if(CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
&& this.isNumConnector(context.getCurrentChar())){
//不输出数字,但不标记结束
}else{
////遇到非Arabic字符,输出词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.arabicStart , this.arabicEnd - this.arabicStart + 1 , Lexeme.TYPE_ARABIC);
context.addLexeme(newLexeme);
this.arabicStart = -1;
this.arabicEnd = -1;
}
}
//判断缓冲区是否已经读完
if(context.isBufferConsumed()){
if(this.arabicStart != -1 && this.arabicEnd != -1){
//生成已切分的词元
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , this.arabicStart , this.arabicEnd - this.arabicStart + 1 , Lexeme.TYPE_ARABIC);
context.addLexeme(newLexeme);
this.arabicStart = -1;
this.arabicEnd = -1;
}
}
//判断是否锁定缓冲区
if(this.arabicStart == -1 && this.arabicEnd == -1){
//对缓冲区解锁
needLock = false;
}else{
needLock = true;
}
return needLock;
}
/**
* 判断是否是字母连接符号
* @param input
* @return
*/
private boolean isLetterConnector(char input){
int index = Arrays.binarySearch(Letter_Connector, input);
return index >= 0;
}
/**
* 判断是否是数字连接符号
* @param input
* @return
*/
private boolean isNumConnector(char input){
int index = Arrays.binarySearch(Num_Connector, input);
return index >= 0;
}
}
/**
*
*/
package org.wltea.analyzer;
public final class Lexeme implements Comparable<Lexeme>{
public static final int TYPE_CJK_NORMAL = 0;
public static final int TYPE_CJK_SN = 1;
public static final int TYPE_CJK_SF = 2;
public static final int TYPE_CJK_UNKNOWN = 3;
public static final int TYPE_NUM = 10;
public static final int TYPE_NUMCOUNT = 11;
public static final int TYPE_LETTER = 20;
private int offset;
private int begin;
private int length;
private String lexemeText;
private int lexemeType;
private Lexeme prev;
private Lexeme next;
public Lexeme(int offset , int begin , int length , int lexemeType){
this.offset = offset;
this.begin = begin;
if(length < 0){
throw new IllegalArgumentException("length < 0");
}
this.length = length;
this.lexemeType = lexemeType;
}
public boolean equals(Object o){
if(o == null){
return false;
}
if(this == o){
return true;
}
if(o instanceof Lexeme){
Lexeme other = (Lexeme)o;
if(this.offset == other.getOffset()
&& this.begin == other.getBegin()
&& this.length == other.getLength()){
return true;
}else{
return false;
}
}else{
return false;
}
}
public int hashCode(){
int absBegin = getBeginPosition();
int absEnd = getEndPosition();
return (absBegin * 37) + (absEnd * 31) + ((absBegin * absEnd) % getLength()) * 11;
}
public int compareTo(Lexeme other) {
if(this.begin < other.getBegin()){
return -1;
}else if(this.begin == other.getBegin()){
if(this.length > other.getLength()){
return -1;
}else if(this.length == other.getLength()){
return 0;
}else {
return 1;
}
}else{
return 1;
}
}
public boolean isOverlap(Lexeme other){
if(other != null){
if(this.getBeginPosition() <= other.getBeginPosition()
&& this.getEndPosition() >= other.getEndPosition()){
return true;
}else if(this.getBeginPosition() >= other.getBeginPosition()
&& this.getEndPosition() <= other.getEndPosition()){
return true;
}else {
return false;
}
}
return false;
}
public int getOffset() {
return offset;
}
public void setOffset(int offset) {
this.offset = offset;
}
public int getBegin() {
return begin;
}
public int getBeginPosition(){
return offset + begin;
}
public void setBegin(int begin) {
this.begin = begin;
}
public int getEndPosition(){
return offset + begin + length;
}
public int getLength(){
return this.length;
}
public void setLength(int length) {
if(this.length < 0){
throw new IllegalArgumentException("length < 0");
}
this.length = length;
}
public String getLexemeText() {
if(lexemeText == null){
return "";
}
return lexemeText;
}
public void setLexemeText(String lexemeText) {
if(lexemeText == null){
this.lexemeText = "";
this.length = 0;
}else{
this.lexemeText = lexemeText;
this.length = lexemeText.length();
}
}
public int getLexemeType() {
return lexemeType;
}
public void setLexemeType(int lexemeType) {
this.lexemeType = lexemeType;
}
public String toString(){
StringBuffer strbuf = new StringBuffer();
strbuf.append(this.getBeginPosition()).append("-").append(this.getEndPosition());
strbuf.append(" : ").append(this.lexemeText).append(" : \t");
switch(lexemeType) {
case TYPE_CJK_NORMAL :
strbuf.append("CJK_NORMAL");
break;
case TYPE_CJK_SF :
strbuf.append("CJK_SUFFIX");
break;
case TYPE_CJK_SN :
strbuf.append("CJK_NAME");
break;
case TYPE_CJK_UNKNOWN :
strbuf.append("UNKNOWN");
break;
case TYPE_NUM :
strbuf.append("NUMEBER");
break;
case TYPE_NUMCOUNT :
strbuf.append("COUNT");
break;
case TYPE_LETTER :
strbuf.append("LETTER");
break;
}
return strbuf.toString();
}
Lexeme getPrev() {
return prev;
}
void setPrev(Lexeme prev) {
this.prev = prev;
}
Lexeme getNext() {
return next;
}
void setNext(Lexeme next) {
this.next = next;
}
}
\ No newline at end of file
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
/**
* IK词元对象
*/
public class Lexeme implements Comparable<Lexeme>{
//lexemeType常量
//未知
public static final int TYPE_UNKNOWN = 0;
//英文
public static final int TYPE_ENGLISH = 1;
//数字
public static final int TYPE_ARABIC = 2;
//英文数字混合
public static final int TYPE_LETTER = 3;
//中文词元
public static final int TYPE_CNWORD = 4;
//中文单字
public static final int TYPE_CNCHAR = 64;
//日韩文字
public static final int TYPE_OTHER_CJK = 8;
//中文数词
public static final int TYPE_CNUM = 16;
//中文量词
public static final int TYPE_COUNT = 32;
//中文数量词
public static final int TYPE_CQUAN = 48;
//词元的起始位移
private int offset;
//词元的相对起始位置
private int begin;
//词元的长度
private int length;
//词元文本
private String lexemeText;
//词元类型
private int lexemeType;
public Lexeme(int offset , int begin , int length , int lexemeType){
this.offset = offset;
this.begin = begin;
if(length < 0){
throw new IllegalArgumentException("length < 0");
}
this.length = length;
this.lexemeType = lexemeType;
}
/*
* 判断词元相等算法
* 起始位置偏移、起始位置、终止位置相同
* @see java.lang.Object#equals(Object o)
*/
public boolean equals(Object o){
if(o == null){
return false;
}
if(this == o){
return true;
}
if(o instanceof Lexeme){
Lexeme other = (Lexeme)o;
if(this.offset == other.getOffset()
&& this.begin == other.getBegin()
&& this.length == other.getLength()){
return true;
}else{
return false;
}
}else{
return false;
}
}
/*
* 词元哈希编码算法
* @see java.lang.Object#hashCode()
*/
public int hashCode(){
int absBegin = getBeginPosition();
int absEnd = getEndPosition();
return (absBegin * 37) + (absEnd * 31) + ((absBegin * absEnd) % getLength()) * 11;
}
/*
* 词元在排序集合中的比较算法
* @see java.lang.Comparable#compareTo(java.lang.Object)
*/
public int compareTo(Lexeme other) {
//起始位置优先
if(this.begin < other.getBegin()){
return -1;
}else if(this.begin == other.getBegin()){
//词元长度优先
if(this.length > other.getLength()){
return -1;
}else if(this.length == other.getLength()){
return 0;
}else {//this.length < other.getLength()
return 1;
}
}else{//this.begin > other.getBegin()
return 1;
}
}
public int getOffset() {
return offset;
}
public void setOffset(int offset) {
this.offset = offset;
}
public int getBegin() {
return begin;
}
/**
* 获取词元在文本中的起始位置
* @return int
*/
public int getBeginPosition(){
return offset + begin;
}
public void setBegin(int begin) {
this.begin = begin;
}
/**
* 获取词元在文本中的结束位置
* @return int
*/
public int getEndPosition(){
return offset + begin + length;
}
/**
* 获取词元的字符长度
* @return int
*/
public int getLength(){
return this.length;
}
public void setLength(int length) {
if(this.length < 0){
throw new IllegalArgumentException("length < 0");
}
this.length = length;
}
/**
* 获取词元的文本内容
* @return String
*/
public String getLexemeText() {
if(lexemeText == null){
return "";
}
return lexemeText;
}
public void setLexemeText(String lexemeText) {
if(lexemeText == null){
this.lexemeText = "";
this.length = 0;
}else{
this.lexemeText = lexemeText;
this.length = lexemeText.length();
}
}
/**
* 获取词元类型
* @return int
*/
public int getLexemeType() {
return lexemeType;
}
/**
* 获取词元类型标示字符串
* @return String
*/
public String getLexemeTypeString(){
switch(lexemeType) {
case TYPE_ENGLISH :
return "ENGLISH";
case TYPE_ARABIC :
return "ARABIC";
case TYPE_LETTER :
return "LETTER";
case TYPE_CNWORD :
return "CN_WORD";
case TYPE_CNCHAR :
return "CN_CHAR";
case TYPE_OTHER_CJK :
return "OTHER_CJK";
case TYPE_COUNT :
return "COUNT";
case TYPE_CNUM :
return "TYPE_CNUM";
case TYPE_CQUAN:
return "TYPE_CQUAN";
default :
return "UNKONW";
}
}
public void setLexemeType(int lexemeType) {
this.lexemeType = lexemeType;
}
/**
* 合并两个相邻的词元
* @param l
* @param lexemeType
* @return boolean 词元是否成功合并
*/
public boolean append(Lexeme l , int lexemeType){
if(l != null && this.getEndPosition() == l.getBeginPosition()){
this.length += l.getLength();
this.lexemeType = lexemeType;
return true;
}else {
return false;
}
}
/**
*
*/
public String toString(){
StringBuffer strbuf = new StringBuffer();
strbuf.append(this.getBeginPosition()).append("-").append(this.getEndPosition());
strbuf.append(" : ").append(this.lexemeText).append(" : \t");
strbuf.append(this.getLexemeTypeString());
return strbuf.toString();
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
/**
* Lexeme链(路径)
*/
class LexemePath extends QuickSortSet implements Comparable<LexemePath>{
//起始位置
private int pathBegin;
//结束
private int pathEnd;
//词元链的有效字符长度
private int payloadLength;
LexemePath(){
this.pathBegin = -1;
this.pathEnd = -1;
this.payloadLength = 0;
}
/**
* 向LexemePath追加相交的Lexeme
* @param lexeme
* @return
*/
boolean addCrossLexeme(Lexeme lexeme){
if(this.isEmpty()){
this.addLexeme(lexeme);
this.pathBegin = lexeme.getBegin();
this.pathEnd = lexeme.getBegin() + lexeme.getLength();
this.payloadLength += lexeme.getLength();
return true;
}else if(this.checkCross(lexeme)){
this.addLexeme(lexeme);
if(lexeme.getBegin() + lexeme.getLength() > this.pathEnd){
this.pathEnd = lexeme.getBegin() + lexeme.getLength();
}
this.payloadLength = this.pathEnd - this.pathBegin;
return true;
}else{
return false;
}
}
/**
* 向LexemePath追加不相交的Lexeme
* @param lexeme
* @return
*/
boolean addNotCrossLexeme(Lexeme lexeme){
if(this.isEmpty()){
this.addLexeme(lexeme);
this.pathBegin = lexeme.getBegin();
this.pathEnd = lexeme.getBegin() + lexeme.getLength();
this.payloadLength += lexeme.getLength();
return true;
}else if(this.checkCross(lexeme)){
return false;
}else{
this.addLexeme(lexeme);
this.payloadLength += lexeme.getLength();
Lexeme head = this.peekFirst();
this.pathBegin = head.getBegin();
Lexeme tail = this.peekLast();
this.pathEnd = tail.getBegin() + tail.getLength();
return true;
}
}
/**
* 移除尾部的Lexeme
* @return
*/
Lexeme removeTail(){
Lexeme tail = this.pollLast();
if(this.isEmpty()){
this.pathBegin = -1;
this.pathEnd = -1;
this.payloadLength = 0;
}else{
this.payloadLength -= tail.getLength();
Lexeme newTail = this.peekLast();
this.pathEnd = newTail.getBegin() + newTail.getLength();
}
return tail;
}
/**
* 检测词元位置交叉(有歧义的切分)
* @param lexeme
* @return
*/
boolean checkCross(Lexeme lexeme){
return (lexeme.getBegin() >= this.pathBegin && lexeme.getBegin() < this.pathEnd)
|| (this.pathBegin >= lexeme.getBegin() && this.pathBegin < lexeme.getBegin()+ lexeme.getLength());
}
int getPathBegin() {
return pathBegin;
}
int getPathEnd() {
return pathEnd;
}
/**
* 获取Path的有效词长
* @return
*/
int getPayloadLength(){
return this.payloadLength;
}
/**
* 获取LexemePath的路径长度
* @return
*/
int getPathLength(){
return this.pathEnd - this.pathBegin;
}
/**
* X权重(词元长度积)
* @return
*/
int getXWeight(){
int product = 1;
Cell c = this.getHead();
while( c != null && c.getLexeme() != null){
product *= c.getLexeme().getLength();
c = c.getNext();
}
return product;
}
/**
* 词元位置权重
* @return
*/
int getPWeight(){
int pWeight = 0;
int p = 0;
Cell c = this.getHead();
while( c != null && c.getLexeme() != null){
p++;
pWeight += p * c.getLexeme().getLength() ;
c = c.getNext();
}
return pWeight;
}
LexemePath copy(){
LexemePath theCopy = new LexemePath();
theCopy.pathBegin = this.pathBegin;
theCopy.pathEnd = this.pathEnd;
theCopy.payloadLength = this.payloadLength;
Cell c = this.getHead();
while( c != null && c.getLexeme() != null){
theCopy.addLexeme(c.getLexeme());
c = c.getNext();
}
return theCopy;
}
public int compareTo(LexemePath o) {
//比较有效文本长度
if(this.payloadLength > o.payloadLength){
return -1;
}else if(this.payloadLength < o.payloadLength){
return 1;
}else{
//比较词元个数,越少越好
if(this.size() < o.size()){
return -1;
}else if (this.size() > o.size()){
return 1;
}else{
//路径跨度越大越好
if(this.getPathLength() > o.getPathLength()){
return -1;
}else if(this.getPathLength() < o.getPathLength()){
return 1;
}else {
//根据统计学结论,逆向切分概率高于正向切分,因此位置越靠后的优先
if(this.pathEnd > o.pathEnd){
return -1;
}else if(pathEnd < o.pathEnd){
return 1;
}else{
//词长越平均越好
if(this.getXWeight() > o.getXWeight()){
return -1;
}else if(this.getXWeight() < o.getXWeight()){
return 1;
}else {
//词元位置权重比较
if(this.getPWeight() > o.getPWeight()){
return -1;
}else if(this.getPWeight() < o.getPWeight()){
return 1;
}
}
}
}
}
}
return 0;
}
public String toString(){
StringBuffer sb = new StringBuffer();
sb.append("pathBegin : ").append(pathBegin).append("\r\n");
sb.append("pathEnd : ").append(pathEnd).append("\r\n");
sb.append("payloadLength : ").append(payloadLength).append("\r\n");
Cell head = this.getHead();
while(head != null){
sb.append("lexeme : ").append(head.getLexeme()).append("\r\n");
head = head.getNext();
}
return sb.toString();
}
}
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.core;
/**
* IK分词器专用的Lexem快速排序集合
*/
class QuickSortSet {
//链表头
private Cell head;
//链表尾
private Cell tail;
//链表的实际大小
private int size;
QuickSortSet(){
this.size = 0;
}
/**
* 向链表集合添加词元
* @param lexeme
*/
boolean addLexeme(Lexeme lexeme){
Cell newCell = new Cell(lexeme);
if(this.size == 0){
this.head = newCell;
this.tail = newCell;
this.size++;
return true;
}else{
if(this.tail.compareTo(newCell) == 0){//词元与尾部词元相同,不放入集合
return false;
}else if(this.tail.compareTo(newCell) < 0){//词元接入链表尾部
this.tail.next = newCell;
newCell.prev = this.tail;
this.tail = newCell;
this.size++;
return true;
}else if(this.head.compareTo(newCell) > 0){//词元接入链表头部
this.head.prev = newCell;
newCell.next = this.head;
this.head = newCell;
this.size++;
return true;
}else{
//从尾部上逆
Cell index = this.tail;
while(index != null && index.compareTo(newCell) > 0){
index = index.prev;
}
if(index.compareTo(newCell) == 0){//词元与集合中的词元重复,不放入集合
return false;
}else if(index.compareTo(newCell) < 0){//词元插入链表中的某个位置
newCell.prev = index;
newCell.next = index.next;
index.next.prev = newCell;
index.next = newCell;
this.size++;
return true;
}
}
}
return false;
}
/**
* 返回链表头部元素
* @return
*/
Lexeme peekFirst(){
if(this.head != null){
return this.head.lexeme;
}
return null;
}
/**
* 取出链表集合的第一个元素
* @return Lexeme
*/
Lexeme pollFirst(){
if(this.size == 1){
Lexeme first = this.head.lexeme;
this.head = null;
this.tail = null;
this.size--;
return first;
}else if(this.size > 1){
Lexeme first = this.head.lexeme;
this.head = this.head.next;
this.size --;
return first;
}else{
return null;
}
}
/**
* 返回链表尾部元素
* @return
*/
Lexeme peekLast(){
if(this.tail != null){
return this.tail.lexeme;
}
return null;
}
/**
* 取出链表集合的最后一个元素
* @return Lexeme
*/
Lexeme pollLast(){
if(this.size == 1){
Lexeme last = this.head.lexeme;
this.head = null;
this.tail = null;
this.size--;
return last;
}else if(this.size > 1){
Lexeme last = this.tail.lexeme;
this.tail = this.tail.prev;
this.size--;
return last;
}else{
return null;
}
}
/**
* 返回集合大小
* @return
*/
int size(){
return this.size;
}
/**
* 判断集合是否为空
* @return
*/
boolean isEmpty(){
return this.size == 0;
}
/**
* 返回lexeme链的头部
* @return
*/
Cell getHead(){
return this.head;
}
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
* QuickSortSet集合单元
*
*/
class Cell implements Comparable<Cell>{
private Cell prev;
private Cell next;
private Lexeme lexeme;
Cell(Lexeme lexeme){
if(lexeme == null){
throw new IllegalArgumentException("lexeme must not be null");
}
this.lexeme = lexeme;
}
public int compareTo(Cell o) {
return this.lexeme.compareTo(o.lexeme);
}
public Cell getPrev(){
return this.prev;
}
public Cell getNext(){
return this.next;
}
public Lexeme getLexeme(){
return this.lexeme;
}
}
}
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.dic;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
public class DictSegment {
/**
* 词典树分段,表示词典树的一个分枝
*/
class DictSegment implements Comparable<DictSegment>{
//公用字典表,存储汉字
private static final Map<Character , Character> charMap = new HashMap<Character , Character>(16 , 0.95f);
//数组大小上限
private static final int ARRAY_LENGTH_LIMIT = 3;
private Character nodeChar;
//Map存储结构
private Map<Character , DictSegment> childrenMap;
//数组方式存储结构
private DictSegment[] childrenArray;
private int storeSize = 0;
//当前节点上存储的字符
private Character nodeChar;
//当前节点存储的Segment数目
//storeSize <=ARRAY_LENGTH_LIMIT ,使用数组存储, storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储
private int storeSize = 0;
//当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词
private int nodeState = 0;
public DictSegment(Character nodeChar){
DictSegment(Character nodeChar){
if(nodeChar == null){
throw new IllegalArgumentException("参数为空异常,字符不能为空");
}
this.nodeChar = nodeChar;
}
public int getDicNum(){
if(charMap!=null)
{
return charMap.size();
}
return 0;
}
public Character getNodeChar() {
Character getNodeChar() {
return nodeChar;
}
/*
* 判断是否有下一个节点
*/
public boolean hasNextNode(){
boolean hasNextNode(){
return this.storeSize > 0;
}
......@@ -62,7 +78,7 @@ public class DictSegment {
* @param charArray
* @return Hit
*/
public Hit match(char[] charArray){
Hit match(char[] charArray){
return this.match(charArray , 0 , charArray.length , null);
}
......@@ -73,7 +89,7 @@ public class DictSegment {
* @param length
* @return Hit
*/
public Hit match(char[] charArray , int begin , int length){
Hit match(char[] charArray , int begin , int length){
return this.match(charArray , begin , length , null);
}
......@@ -85,64 +101,64 @@ public class DictSegment {
* @param searchHit
* @return Hit
*/
public Hit match(char[] charArray , int begin , int length , Hit searchHit){
Hit match(char[] charArray , int begin , int length , Hit searchHit){
if(searchHit == null){
//如果hit为空,新建
searchHit= new Hit();
//设置hit的其实文本位置
searchHit.setBegin(begin);
}else{
//否则要将HIT状态重置
searchHit.setUnmatch();
}
//设置hit的当前处理位置
searchHit.setEnd(begin);
Character keyChar = new Character(charArray[begin]);
DictSegment ds = null;
//引用实例变量为本地变量,避免查询时遇到更新的同步问题
DictSegment[] segmentArray = this.childrenArray;
Map<Character , DictSegment> segmentMap = this.childrenMap;
//STEP1 在节点中查找keyChar对应的DictSegment
if(segmentArray != null){
for(DictSegment seg : segmentArray){
if(seg != null && seg.nodeChar.equals(keyChar)){
ds = seg;
}
//在数组中查找
DictSegment keySegment = new DictSegment(keyChar);
int position = Arrays.binarySearch(segmentArray, 0 , this.storeSize , keySegment);
if(position >= 0){
ds = segmentArray[position];
}
}else if(segmentMap != null){
}else if(segmentMap != null){
//在map中查找
ds = (DictSegment)segmentMap.get(keyChar);
}
//STEP2 找到DictSegment,判断词的匹配状态,是否继续递归,还是返回结果
if(ds != null){
if(length > 1){
//词未匹配完,继续往下搜索
return ds.match(charArray, begin + 1 , length - 1 , searchHit);
}else if (length == 1){
//搜索最后一个char
if(ds.nodeState == 1){
//添加HIT状态为完全匹配
searchHit.setMatch();
}
if(ds.hasNextNode()){
//添加HIT状态为前缀匹配
searchHit.setPrefix();
//记录当前位置的DictSegment
searchHit.setMatchedDictSegment(ds);
}
return searchHit;
}
}
//STEP3 没有找到DictSegment, 将HIT设置为不匹配
return searchHit;
}
......@@ -150,8 +166,16 @@ public class DictSegment {
* 加载填充词典片段
* @param charArray
*/
public void fillSegment(char[] charArray){
this.fillSegment(charArray, 0 , charArray.length);
void fillSegment(char[] charArray){
this.fillSegment(charArray, 0 , charArray.length , 1);
}
/**
* 屏蔽词典中的一个词
* @param charArray
*/
void disableSegment(char[] charArray){
this.fillSegment(charArray, 0 , charArray.length , 0);
}
/**
......@@ -159,86 +183,90 @@ public class DictSegment {
* @param charArray
* @param begin
* @param length
* @param enabled
*/
public synchronized void fillSegment(char[] charArray , int begin , int length){
private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
//获取字典表中的汉字对象
Character beginChar = new Character(charArray[begin]);
Character keyChar = charMap.get(beginChar);
//字典中没有该字,则将其添加入字典
if(keyChar == null){
charMap.put(beginChar, beginChar);
keyChar = beginChar;
}
DictSegment ds = lookforSegment(keyChar);
if(length > 1){
ds.fillSegment(charArray, begin + 1, length - 1);
}else if (length == 1){
ds.nodeState = 1;
//搜索当前节点的存储,查询对应keyChar的keyChar,如果没有则创建
DictSegment ds = lookforSegment(keyChar , enabled);
if(ds != null){
//处理keyChar对应的segment
if(length > 1){
//词元还没有完全加入词典树
ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
}else if (length == 1){
//已经是词元的最后一个char,设置当前节点状态为enabled,
//enabled=1表明一个完整的词,enabled=0表示从词典中屏蔽当前词
ds.nodeState = enabled;
}
}
}
/**
* 查找本节点下对应的keyChar的segment
* 如果没有找到,则创建新的segment
* 查找本节点下对应的keyChar的segment *
* @param keyChar
* @param create =1如果没有找到,则创建新的segment ; =0如果没有找到,不创建,返回null
* @return
*/
private DictSegment lookforSegment(Character keyChar){
private DictSegment lookforSegment(Character keyChar , int create){
DictSegment ds = null;
if(this.storeSize <= ARRAY_LENGTH_LIMIT){
//获取数组容器,如果数组未创建则创建数组
DictSegment[] segmentArray = getChildrenArray();
for(DictSegment segment : segmentArray){
if(segment != null && segment.nodeChar.equals(keyChar)){
ds = segment;
break;
}
}
if(ds == null){
ds = new DictSegment(keyChar);
//搜寻数组
DictSegment keySegment = new DictSegment(keyChar);
int position = Arrays.binarySearch(segmentArray, 0 , this.storeSize, keySegment);
if(position >= 0){
ds = segmentArray[position];
}
//遍历数组后没有找到对应的segment
if(ds == null && create == 1){
ds = keySegment;
if(this.storeSize < ARRAY_LENGTH_LIMIT){
//数组容量未满,使用数组存储
segmentArray[this.storeSize] = ds;
//segment数目+1
this.storeSize++;
Arrays.sort(segmentArray , 0 , this.storeSize);
}else{
//数组容量已满,切换Map存储
//获取Map容器,如果Map未创建,则创建Map
Map<Character , DictSegment> segmentMap = getChildrenMap();
//将数组中的segment迁移到Map中
migrate(segmentArray , segmentMap);
//存储新的segment
segmentMap.put(keyChar, ds);
//segment数目+1 , 必须在释放数组前执行storeSize++ , 确保极端情况下,不会取到空的数组
this.storeSize++;
//释放当前的数组引用
this.childrenArray = null;
}
}
}else{
//获取Map容器,如果Map未创建,则创建Map
Map<Character , DictSegment> segmentMap = getChildrenMap();
//搜索Map
ds = (DictSegment)segmentMap.get(keyChar);
if(ds == null){
if(ds == null && create == 1){
//构造新的segment
ds = new DictSegment(keyChar);
segmentMap.put(keyChar , ds);
//当前节点存储segment数目+1
this.storeSize ++;
}
}
......@@ -288,5 +316,23 @@ public class DictSegment {
}
}
}
/**
* 实现Comparable接口
* @param o
* @return int
*/
public int compareTo(DictSegment o) {
//对当前节点存储的char进行比较
return this.nodeChar.compareTo(o.nodeChar);
}
public int getDicNum(){
if(charMap!=null)
{
return charMap.size();
}
return 0;
}
}
......@@ -47,15 +47,15 @@ public class Dictionary {
logger = Loggers.getLogger("ik-analyzer");
}
public Configuration getConfig(){
return configuration;
}
public void Init(Settings settings){
// logger.info("[Init Setting] {}",settings.getAsMap().toString());
public void Init(Settings indexSettings){
if(!dictInited){
environment =new Environment(settings);
configuration=new Configuration(settings);
environment =new Environment(indexSettings);
configuration=new Configuration(indexSettings);
loadMainDict();
loadSurnameDict();
loadQuantifierDict();
......@@ -71,16 +71,6 @@ public class Dictionary {
File file= new File(environment.configFile(), Dictionary.PATH_DIC_MAIN);
// logger.info("[Main Dict Loading] {}",file.getAbsolutePath());
// logger.info("[Environment] {}",environment.homeFile());
// logger.info("[Environment] {}",environment.workFile());
// logger.info("[Environment] {}",environment.workWithClusterFile());
// logger.info("[Environment] {}",environment.dataFiles());
// logger.info("[Environment] {}",environment.dataWithClusterFiles());
// logger.info("[Environment] {}",environment.configFile());
// logger.info("[Environment] {}",environment.pluginsFile());
// logger.info("[Environment] {}",environment.logsFile());
InputStream is = null;
try {
is = new FileInputStream(file);
......@@ -142,7 +132,7 @@ public class Dictionary {
if (theWord != null && !"".equals(theWord.trim())) {
_MainDict.fillSegment(theWord.trim().toCharArray());
_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
}
} while (theWord != null);
logger.info("[Dict Loading] {},MainDict Size:{}",tempFile.toString(),_MainDict.getDicNum());
......
/**
*
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*/
package org.wltea.analyzer.dic;
/**
* 表示一次词典匹配的命中
*/
public class Hit {
//Hit不匹配
private static final int UNMATCH = 0x00000000;
//Hit完全匹配
private static final int MATCH = 0x00000001;
//Hit前缀匹配
private static final int PREFIX = 0x00000010;
//该HIT当前状态,默认未匹配
private int hitState = UNMATCH;
//记录词典匹配过程中,当前匹配到的词典分支节点
private DictSegment matchedDictSegment;
/*
* 词段开始位置
*/
private int begin;
/*
* 词段的结束位置
*/
private int end;
/**
* 判断是否完全匹配
*/
public boolean isMatch() {
return (this.hitState & MATCH) > 0;
}
......@@ -32,6 +63,9 @@ public class Hit {
this.hitState = this.hitState | MATCH;
}
/**
* 判断是否是词的前缀
*/
public boolean isPrefix() {
return (this.hitState & PREFIX) > 0;
}
......@@ -39,7 +73,9 @@ public class Hit {
public void setPrefix() {
this.hitState = this.hitState | PREFIX;
}
/**
* 判断是否是不匹配
*/
public boolean isUnmatch() {
return this.hitState == UNMATCH ;
}
......
......@@ -13,8 +13,9 @@ import java.io.Reader;
public final class IKAnalyzer extends Analyzer {
private boolean isMaxWordLength = false;
private boolean useSmart=false;
public IKAnalyzer(){
public IKAnalyzer(){
this(false);
}
......@@ -24,14 +25,19 @@ public final class IKAnalyzer extends Analyzer {
this.setMaxWordLength(isMaxWordLength);
}
public IKAnalyzer(Settings settings) {
Dictionary.getInstance().Init(settings);
public IKAnalyzer(Settings indexSetting,Settings settings1) {
super();
Dictionary.getInstance().Init(indexSetting);
if(settings1.get("use_smart", "true").equals("true")){
useSmart=true;
}
}
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new IKTokenizer(reader , isMaxWordLength());
return new IKTokenizer(reader , useSmart);
}
public void setMaxWordLength(boolean isMaxWordLength) {
......
/**
*
*/
package org.wltea.analyzer.lucene;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanClause.Occur;
import org.wltea.analyzer.IKSegmentation;
import org.wltea.analyzer.Lexeme;
public final class IKQueryParser {
private static ThreadLocal<Map<String , TokenBranch>> keywordCacheThreadLocal
= new ThreadLocal<Map<String , TokenBranch>>();
private static boolean isMaxWordLength = false;
public static void setMaxWordLength(boolean isMaxWordLength) {
IKQueryParser.isMaxWordLength = isMaxWordLength ;
}
private static Query optimizeQueries(List<Query> queries){
if(queries.size() == 0){
return null;
}else if(queries.size() == 1){
return queries.get(0);
}else{
BooleanQuery mustQueries = new BooleanQuery();
for(Query q : queries){
mustQueries.add(q, Occur.MUST);
}
return mustQueries;
}
}
private static Map<String , TokenBranch> getTheadLocalCache(){
Map<String , TokenBranch> keywordCache = keywordCacheThreadLocal.get();
if(keywordCache == null){
keywordCache = new HashMap<String , TokenBranch>(4);
keywordCacheThreadLocal.set(keywordCache);
}
return keywordCache;
}
private static TokenBranch getCachedTokenBranch(String query){
Map<String , TokenBranch> keywordCache = getTheadLocalCache();
return keywordCache.get(query);
}
private static void cachedTokenBranch(String query , TokenBranch tb){
Map<String , TokenBranch> keywordCache = getTheadLocalCache();
keywordCache.put(query, tb);
}
private static Query _parse(String field , String query) throws IOException{
if(field == null){
throw new IllegalArgumentException("parameter \"field\" is null");
}
if(query == null || "".equals(query.trim())){
return new TermQuery(new Term(field));
}
TokenBranch root = getCachedTokenBranch(query);
if(root != null){
return optimizeQueries(root.toQueries(field));
}else{
root = new TokenBranch(null);
StringReader input = new StringReader(query.trim());
IKSegmentation ikSeg = new IKSegmentation(input , isMaxWordLength);
for(Lexeme lexeme = ikSeg.next() ; lexeme != null ; lexeme = ikSeg.next()){
root.accept(lexeme);
}
cachedTokenBranch(query , root);
return optimizeQueries(root.toQueries(field));
}
}
public static Query parse(String field , String query) throws IOException{
if(field == null){
throw new IllegalArgumentException("parameter \"field\" is null");
}
String[] qParts = query.split("\\s");
if(qParts.length > 1){
BooleanQuery resultQuery = new BooleanQuery();
for(String q : qParts){
if("".equals(q)){
continue;
}
Query partQuery = _parse(field , q);
if(partQuery != null &&
(!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
resultQuery.add(partQuery, Occur.SHOULD);
}
}
return resultQuery;
}else{
return _parse(field , query);
}
}
public static Query parseMultiField(String[] fields , String query) throws IOException{
if(fields == null){
throw new IllegalArgumentException("parameter \"fields\" is null");
}
BooleanQuery resultQuery = new BooleanQuery();
for(String field : fields){
if(field != null){
Query partQuery = parse(field , query);
if(partQuery != null &&
(!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
resultQuery.add(partQuery, Occur.SHOULD);
}
}
}
return resultQuery;
}
public static Query parseMultiField(String[] fields , String query , BooleanClause.Occur[] flags) throws IOException{
if(fields == null){
throw new IllegalArgumentException("parameter \"fields\" is null");
}
if(flags == null){
throw new IllegalArgumentException("parameter \"flags\" is null");
}
if (flags.length != fields.length){
throw new IllegalArgumentException("flags.length != fields.length");
}
BooleanQuery resultQuery = new BooleanQuery();
for(int i = 0; i < fields.length; i++){
if(fields[i] != null){
Query partQuery = parse(fields[i] , query);
if(partQuery != null &&
(!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
resultQuery.add(partQuery, flags[i]);
}
}
}
return resultQuery;
}
public static Query parseMultiField(String[] fields , String[] queries) throws IOException{
if(fields == null){
throw new IllegalArgumentException("parameter \"fields\" is null");
}
if(queries == null){
throw new IllegalArgumentException("parameter \"queries\" is null");
}
if (queries.length != fields.length){
throw new IllegalArgumentException("queries.length != fields.length");
}
BooleanQuery resultQuery = new BooleanQuery();
for(int i = 0; i < fields.length; i++){
if(fields[i] != null){
Query partQuery = parse(fields[i] , queries[i]);
if(partQuery != null &&
(!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
resultQuery.add(partQuery, Occur.SHOULD);
}
}
}
return resultQuery;
}
public static Query parseMultiField(String[] fields , String[] queries , BooleanClause.Occur[] flags) throws IOException{
if(fields == null){
throw new IllegalArgumentException("parameter \"fields\" is null");
}
if(queries == null){
throw new IllegalArgumentException("parameter \"queries\" is null");
}
if(flags == null){
throw new IllegalArgumentException("parameter \"flags\" is null");
}
if (!(queries.length == fields.length && queries.length == flags.length)){
throw new IllegalArgumentException("queries, fields, and flags array have have different length");
}
BooleanQuery resultQuery = new BooleanQuery();
for(int i = 0; i < fields.length; i++){
if(fields[i] != null){
Query partQuery = parse(fields[i] , queries[i]);
if(partQuery != null &&
(!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
resultQuery.add(partQuery, flags[i]);
}
}
}
return resultQuery;
}
private static class TokenBranch{
private static final int REFUSED = -1;
private static final int ACCEPTED = 0;
private static final int TONEXT = 1;
private int leftBorder;
private int rightBorder;
private Lexeme lexeme;
private List<TokenBranch> acceptedBranchs;
private TokenBranch nextBranch;
TokenBranch(Lexeme lexeme){
if(lexeme != null){
this.lexeme = lexeme;
this.leftBorder = lexeme.getBeginPosition();
this.rightBorder = lexeme.getEndPosition();
}
}
public int getLeftBorder() {
return leftBorder;
}
public int getRightBorder() {
return rightBorder;
}
public Lexeme getLexeme() {
return lexeme;
}
public List<TokenBranch> getAcceptedBranchs() {
return acceptedBranchs;
}
public TokenBranch getNextBranch() {
return nextBranch;
}
public int hashCode(){
if(this.lexeme == null){
return 0;
}else{
return this.lexeme.hashCode() * 37;
}
}
public boolean equals(Object o){
if(o == null){
return false;
}
if(this == o){
return true;
}
if(o instanceof TokenBranch){
TokenBranch other = (TokenBranch)o;
if(this.lexeme == null ||
other.getLexeme() == null){
return false;
}else{
return this.lexeme.equals(other.getLexeme());
}
}else{
return false;
}
}
boolean accept(Lexeme _lexeme){
/*
* 检查新的lexeme 对当前的branch 的可接受类型
* acceptType : REFUSED 不能接受
* acceptType : ACCEPTED 接受
* acceptType : TONEXT 由相邻分支接受
*/
int acceptType = checkAccept(_lexeme);
switch(acceptType){
case REFUSED:
return false;
case ACCEPTED :
if(acceptedBranchs == null){
acceptedBranchs = new ArrayList<TokenBranch>(2);
acceptedBranchs.add(new TokenBranch(_lexeme));
}else{
boolean acceptedByChild = false;
for(TokenBranch childBranch : acceptedBranchs){
acceptedByChild = childBranch.accept(_lexeme) || acceptedByChild;
}
if(!acceptedByChild){
acceptedBranchs.add(new TokenBranch(_lexeme));
}
}
if(_lexeme.getEndPosition() > this.rightBorder){
this.rightBorder = _lexeme.getEndPosition();
}
break;
case TONEXT :
if(this.nextBranch == null){
this.nextBranch = new TokenBranch(null);
}
this.nextBranch.accept(_lexeme);
break;
}
return true;
}
List<Query> toQueries(String fieldName){
List<Query> queries = new ArrayList<Query>(1);
if(lexeme != null){
queries.add(new TermQuery(new Term(fieldName , lexeme.getLexemeText())));
}
if(acceptedBranchs != null && acceptedBranchs.size() > 0){
if(acceptedBranchs.size() == 1){
Query onlyOneQuery = optimizeQueries(acceptedBranchs.get(0).toQueries(fieldName));
if(onlyOneQuery != null){
queries.add(onlyOneQuery);
}
}else{
BooleanQuery orQuery = new BooleanQuery();
for(TokenBranch childBranch : acceptedBranchs){
Query childQuery = optimizeQueries(childBranch.toQueries(fieldName));
if(childQuery != null){
orQuery.add(childQuery, Occur.SHOULD);
}
}
if(orQuery.getClauses().length > 0){
queries.add(orQuery);
}
}
}
if(nextBranch != null){
queries.addAll(nextBranch.toQueries(fieldName));
}
return queries;
}
private int checkAccept(Lexeme _lexeme){
int acceptType = 0;
if(_lexeme == null){
throw new IllegalArgumentException("parameter:lexeme is null");
}
if(null == this.lexeme){
if(this.rightBorder > 0
&& _lexeme.getBeginPosition() >= this.rightBorder){
acceptType = TONEXT;
}else{
acceptType = ACCEPTED;
}
}else{
if(_lexeme.getBeginPosition() < this.lexeme.getBeginPosition()){
acceptType = REFUSED;
}else if(_lexeme.getBeginPosition() >= this.lexeme.getBeginPosition()
&& _lexeme.getBeginPosition() < this.lexeme.getEndPosition()){
acceptType = REFUSED;
}else if(_lexeme.getBeginPosition() >= this.lexeme.getEndPosition()
&& _lexeme.getBeginPosition() < this.rightBorder){
acceptType = ACCEPTED;
}else{
acceptType= TONEXT;
}
}
return acceptType;
}
}
}
/**
*
*/
package org.wltea.analyzer.lucene;
import org.apache.lucene.search.DefaultSimilarity;
public class IKSimilarity extends DefaultSimilarity {
private static final long serialVersionUID = 7558565500061194774L;
public float coord(int overlap, int maxOverlap) {
float overlap2 = (float)Math.pow(2, overlap);
float maxOverlap2 = (float)Math.pow(2, maxOverlap);
return (overlap2 / maxOverlap2);
}
}
///**
// * IK 中文分词 版本 5.0
// * IK Analyzer release 5.0
// *
// * Licensed to the Apache Software Foundation (ASF) under one or more
// * contributor license agreements. See the NOTICE file distributed with
// * this work for additional information regarding copyright ownership.
// * The ASF licenses this file to You under the Apache License, Version 2.0
// * (the "License"); you may not use this file except in compliance with
// * the License. You may obtain a copy of the License at
// *
// * http://www.apache.org/licenses/LICENSE-2.0
// *
// * Unless required by applicable law or agreed to in writing, software
// * distributed under the License is distributed on an "AS IS" BASIS,
// * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// * See the License for the specific language governing permissions and
// * limitations under the License.
// *
// * 源代码由林良益(linliangyi2005@gmail.com)提供
// * 版权声明 2012,乌龙茶工作室
// * provided by Linliangyi and copyright 2012 by Oolong studio
// *
// */
//package org.wltea.analyzer.query;
//
//import java.io.IOException;
//import java.io.StringReader;
//import java.util.ArrayList;
//import java.util.List;
//
//import org.apache.lucene.analysis.standard.StandardAnalyzer;
//import org.apache.lucene.queryparser.classic.ParseException;
//import org.apache.lucene.queryparser.classic.QueryParser;
//import org.apache.lucene.search.Query;
//import org.apache.lucene.util.Version;
//import org.wltea.analyzer.core.IKSegmenter;
//import org.wltea.analyzer.core.Lexeme;
//
///**
// * Single Word Multi Char Query Builder
// * IK分词算法专用
// * @author linliangyi
// *
// */
//public class SWMCQueryBuilder {
//
// /**
// * 生成SWMCQuery
// * @param fieldName
// * @param keywords
// * @param quickMode
// * @return Lucene Query
// */
// public static Query create(String fieldName ,String keywords , boolean quickMode){
// if(fieldName == null || keywords == null){
// throw new IllegalArgumentException("参数 fieldName 、 keywords 不能为null.");
// }
// //1.对keywords进行分词处理
// List<Lexeme> lexemes = doAnalyze(keywords);
// //2.根据分词结果,生成SWMCQuery
// Query _SWMCQuery = getSWMCQuery(fieldName , lexemes , quickMode);
// return _SWMCQuery;
// }
//
// /**
// * 分词切分,并返回结链表
// * @param keywords
// * @return
// */
// private static List<Lexeme> doAnalyze(String keywords){
// List<Lexeme> lexemes = new ArrayList<Lexeme>();
// IKSegmenter ikSeg = new IKSegmenter(new StringReader(keywords) , true);
// try{
// Lexeme l = null;
// while( (l = ikSeg.next()) != null){
// lexemes.add(l);
// }
// }catch(IOException e){
// e.printStackTrace();
// }
// return lexemes;
// }
//
//
// /**
// * 根据分词结果生成SWMC搜索
// * @param fieldName
// * @param pathOption
// * @param quickMode
// * @return
// */
// private static Query getSWMCQuery(String fieldName , List<Lexeme> lexemes , boolean quickMode){
// //构造SWMC的查询表达式
// StringBuffer keywordBuffer = new StringBuffer();
// //精简的SWMC的查询表达式
// StringBuffer keywordBuffer_Short = new StringBuffer();
// //记录最后词元长度
// int lastLexemeLength = 0;
// //记录最后词元结束位置
// int lastLexemeEnd = -1;
//
// int shortCount = 0;
// int totalCount = 0;
// for(Lexeme l : lexemes){
// totalCount += l.getLength();
// //精简表达式
// if(l.getLength() > 1){
// keywordBuffer_Short.append(' ').append(l.getLexemeText());
// shortCount += l.getLength();
// }
//
// if(lastLexemeLength == 0){
// keywordBuffer.append(l.getLexemeText());
// }else if(lastLexemeLength == 1 && l.getLength() == 1
// && lastLexemeEnd == l.getBeginPosition()){//单字位置相邻,长度为一,合并)
// keywordBuffer.append(l.getLexemeText());
// }else{
// keywordBuffer.append(' ').append(l.getLexemeText());
//
// }
// lastLexemeLength = l.getLength();
// lastLexemeEnd = l.getEndPosition();
// }
//
// //借助lucene queryparser 生成SWMC Query
// QueryParser qp = new QueryParser(Version.LUCENE_40, fieldName, new StandardAnalyzer(Version.LUCENE_40));
// qp.setDefaultOperator(QueryParser.AND_OPERATOR);
// qp.setAutoGeneratePhraseQueries(true);
//
// if(quickMode && (shortCount * 1.0f / totalCount) > 0.5f){
// try {
// //System.out.println(keywordBuffer.toString());
// Query q = qp.parse(keywordBuffer_Short.toString());
// return q;
// } catch (ParseException e) {
// e.printStackTrace();
// }
//
// }else{
// if(keywordBuffer.length() > 0){
// try {
// //System.out.println(keywordBuffer.toString());
// Query q = qp.parse(keywordBuffer.toString());
// return q;
// } catch (ParseException e) {
// e.printStackTrace();
// }
// }
// }
// return null;
// }
//}
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册