作业三 最大匹配分词法

作业三 最大匹配分词法

ID:17898930

大小:174.00 KB

页数:9页

时间:2018-09-09

作业三  最大匹配分词法_第1页
作业三  最大匹配分词法_第2页
作业三  最大匹配分词法_第3页
作业三  最大匹配分词法_第4页
作业三  最大匹配分词法_第5页
资源描述:

《作业三 最大匹配分词法》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库

1、作业三中文分词法中文分词法:开始没有思路,后来查了下资料和问了下同学,才搞定的。我的思路:这里说的是最大匹配分词法:首先准备一个分词词表input.txt作为输入,顺序扫描待分词的句子,将句中候选词按照词长从大到小的顺序依次跟词表cizu.txt文件中的词进行匹配,匹配成功即作为一个词输出。这样就使得每次输出的词是长度最大的(相比已知的确定的词表而言)。如果一个句中的多字候选词跟词表中所有的词都匹配不上,自然就只能把单字词当作分词结果输出了。把事先准备好的欲分词文件在目录d:\output\input.txt,那么我们开始执行程序显示之后再看文件夹d:\output里

2、面多了个output.txt的文件,这就是对input.txt做好的分词输出文件。程序的算法思想:首先对一篇文本按照标点符号等自身的分隔符分解成句子,然后对每个句子按照词长MAX_CWORD_LEN=18(9个汉字)的正向最大匹配法进行分词。在划分句子的时候,最关键的操作在确定字串在何处断开成为独立的句子。这里考虑了英文和中文混杂的情况。数据结构:使用二维的指针数组进行词典存储:词典中的每个词利用其第一个字节和最后一个字节进行二维定位来存储,有相同的第一个字节和最后一个字节的多个词串用指针进行相连。这种存储方式极大得提高了查词典的效率,在匹配词串时利用第一字节和最后一个字节

3、直接定位或者通过几级指针快速检索。输入文本文件和输出文本文件都用一维数组进行存储,对空间的要求比较大,避免了多次文件的I/O操作。主要文件有测试文件input.txt,输出文件output.txt,这里还有一个字典,就是每个词后面都有可能出现的词语,比如:人名,人民,人生···等ziguang.txt。实现的部分源程序如下://不进行索引的单词char*arrayEnglishStop[]={"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w",

4、"x","y","z","1","2","3","4","5","6","7","8","9","0","about","above","after","again","all","also","am","an","and","any","are","as","at","back","be","been","before","behind","being","below","but","by","can","click","do","does","done","each","else","etc","ever","every","few","for","from","gen

5、erally","get","go","gone","has","have","hello","here","how","if","in","into","is","just","keep","later","let","like","lot","lots","made","make","makes","many","may","me","more","most","much","must","my","need","no","not","now","of","often","on","only","or","other","others","our","out","ove

6、r","please","put","so","some","such","than","that","the","their","them","then","there","these","they","this","try","to","up","us","very","want","was","we","well","what","when","where","which","why","will","with","within","you","your","yourself"};//词典索引时,字或词不需要索引char*arrayChineseStop[]={"的"

7、,"吗","么","啊","说","对","在","和","是","被","最","所","那","这","有","将","会","与","於","于","他","她","它","您","为","欢迎"};//标点符号及汉字的标点符号,注意+-"这三个符号,因为在搜索的时候需要通过他们进行异或等条件判断chararrayAsciiSymbol[]={'!','\','*','(',')','-','_','+','=','{','}','[',']',':',';',''','"',',','<','>','

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。