Method of Web Page Text Extraction Based on Text Feature and Page Structure

Lulu HU; Xiaoqin LIU; Kai SUN

doi:10.3969/j.issn.1673-6141.2017.03.009

[1] Liu L, Pu C. XWRAP: an XML 2 enable wrapper constructionsystem for the Web information source [C]//Proceedings of the 16th IEEE International Conference onData Engineering, 2000: 611-620.

[2] Ma Ling, Goharian N, Chowdhury A,et al. Extracting unstructured data from template generated Web documents [C]//Proceedings of the 12th International Conference on Information and Knowledge anagement, 2003: 512-515．

[3] Mei Xue, Cheng Xueqi, Guo Yan,et al. Fully automatic Wrapper generation for web information extraction [J]. Journal of Chinese Information Processing, 2008, 22(1): 22-29(in Chinese).

[4] Sun Chengjie, Guan Yi. A statistical approach for content extraction from web page [J].Journal of Chinese Information Processing, 2004, 18(5): 17-22(in Chinese).

[5] Sun Hao, Dong Shoubin. Adaptive approach for content extraction based on tag density [J].Journal of Zhengzhou University, 2009, 41(1): 44-47(in Chinese).

[6] An Zengwen, Wang Chao, Xu Jiefeng. An approach based on machine learning for information extraction method [J].Microcomputer & Its Applications, 2010(12): 4-6(in Chinese).

[7] You Guirong, Lu Yuchang. Extraction of topical information from Chinese web page based on the statistic and machine learning [J].Journal of Fujian Commercial College, 2009, 4(2): 68-72(in Chinese).

微信扫一扫：分享

微信扫一扫：分享