Journal of Software:2017.28(2):262-277

(西北工业大学 计算机学院, 陕西 西安 710129)
Unsupervised Structralization Method of Merchandise Attributes in Chinese
HOU Bo-Yi,CHEN Qun,YANG Jing-Ying,LI Zhan-Huai
(School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710129, China)
Chart / table
Similar Articles
Article :Browse 1438   Download 1754
Received:August 15, 2015    Revised:December 02, 2015
> 中文摘要: 从非结构化商品描述文本中抽取结构化属性信息,对于电子商务实现商品的对比与推荐及用户需求预测等功能具有重要意义.现有结构化方法大多采用监督或半监督的分类方法抽取属性值与属性名,通过文法分析器分析属性值与属性名之间的文法依存关系,并根据关联规则实现属性值与属性名的匹配.这些方法存在以下不足:(1)需要人工标记部分属性值、属性名及它们之间的对应关系;(2)属性值-属性名匹配的准确度受到语言习惯、句意逻辑、语料库及属性名候选集质量的严重制约.提出了一种无监督的中文商品属性结构化方法.该方法借助搜索引擎,基于小概率事件原理分析文法关系来抽取属性值与属性名.同时,提出相对不选取条件概率场,并使用Page Rank算法来计算属性值与属性名的配对概率.该方法无需人工标记的开销,且无论商品描述中是否显式地包含相应的属性名,该方法都能自动抽取到属性值并匹配相应的属性名.使用百度搜索引擎上的真实语料,针对4类商品的中文描述进行了实验.实验结果验证了对于候选属性名的自动生成,所提出的基于搜索引擎搜索属性值,并在包含属性值的搜索结果中抽取一般名词的候选属性名生成方法与只在描述句中抽取一般名词的候选属性名生成方法相比,查全率提高了20%以上;对于非量化类属性,所提出的基于相对不选取条件概率场的属性值-属性名匹配方法与基于依存关联的方法相比,Rank-1的准确率提高了30%以上,平均MRR提高了0.3以上.
Abstract:Extracting attribute names and values from textual product descriptions is important for many e-business applications such as user demand forecasting and product comparison and recommendation. The existing approaches first use supervised or semi-supervised classification techniques to extract attribute names and values, and then match them by analyzing their grammatical dependency. However, those methods have following limitations:(1) They require human intervention to label some attributes, values and the matching relationship between them; (2) The matching accuracy may be greatly affected by language habits, semantic logic, and the quality of corpus and candidates sets. To address these issues, this paper proposes an unsupervised approach for attribute name and value extraction and matching in Chinese textual merchandise descriptions. Taking advantage of search engine, it extracts the candidate set of attribute names with respect to a value by analyzing grammatical relation based on the principle of small probability event. A new algorithm for computing the matching probability between attribute names and values is also designed based on relative conditional deselect probability and Page Rank. The proposed approach can effectively extract attribute names and values from Chinese textual merchandise descriptions and match them without any human intervention, no matter whether the attribute name appears in the textual description or not. Finally, the performance of the proposed approach is evaluated on the textual descriptions of 4 types of merchandise using the search engine of Baidu. The experimental results show that the new approach for attribute name extraction can improve recall by 20%, compared with the approach of directly extracting attribute names from textual descriptions. Moreover, the new approach achieves considerably higher matching accuracy (above 30% if measured by the percentage of rank-1, above 0.3 if measured by MRR) than the existing techniques based on grammatical dependency analysis for non-quantization attributes.
文章编号:     中图分类号:    文献标志码:
基金项目:国家重点基础研究发展计划(973)(2012CB316203);国家自然科学基金(61332006,61472321);西北工业大学基础研究基金(3102014JSJ0013,3102014JSJ0005) 国家重点基础研究发展计划(973)(2012CB316203);国家自然科学基金(61332006,61472321);西北工业大学基础研究基金(3102014JSJ0013,3102014JSJ0005)
Foundation items:National Program on Key Basic Research Project of China (973) (2012CB316203); National Natural Science Foundation of China (61332006, 61472321); Northwestern Polytechnical University Foundation for Fundamental Research (3102014JSJ0013, 3102014JSJ0005)
Reference text:


HOU Bo-Yi,CHEN Qun,YANG Jing-Ying,LI Zhan-Huai.Unsupervised Structralization Method of Merchandise Attributes in Chinese.Journal of Software,2017,28(2):262-277