題名: Automatic Extraction of Information Blocks Using PAT Trees
作者: Chang, Chia-Hui
Hsu, Chun-Nan
期刊名/會議名稱: 1999 NCS會議
摘要: Information extraction from semi-structured Web documents is a critical issue for software agents on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors, but this approach still requires human intervention to provide training examples. In this paper, we present a novel approach that extracts information blocks without training examples using a data structure called a PAT tree. PAT trees allow the system to efficiently recognize repeated patterns in a semi-structured Web page. From these repeated patterns, information blocks can be easily located based on some domain independent selection criteria. The entire system runs automatically without any human intervention. Experimental results show that our approach performs well with a recall rate near 90 percent on a wide range of output pages of popular search engines.
日期: 2006-11-13
分類:1999年 NCS 全國計算機會議

文件中的檔案:
檔案 描述 大小格式 
ce07ncs001999000117.pdf852.95 kBAdobe PDF檢視/開啟


在 DSpace 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。