N-gram: a language independent approach to IR and NLP
P Majumder, M Mitra, B.B. Chaudhuri
Computer vision and pattern recognition Unit
Indian Statistical Institute, Kolkata
mandar@isical.ac.in
Abstract
With the increasingly widespread use of computers & the Internet in India, large amounts of information inIndian languages are becoming available on the web. Automatic information processing and retrieval is therefore becoming an urgent need in the Indian context. Moreover, since India is a multilingual country, any effective approach to IR in the Indian context needs to be capable of handling a multilingual collection of documents. In this paper, we discuss the N-gram approach to developing some basic tools in the area of IR and NLP. This approach is statistical and language independent in nature, and therefore eminently suited to the multilingual Indian context. We first present a brief survey of some language-processing applications in which N-grams have been successfully used. We also present the results of some preliminary experiments on using N-grams for identifying the language of an Indian language document,
based on a method proposed by Cavnar et al [1].
http://www.unl.fi.upm.es/consorcio/archivos/publicaciones/goa/paper15.pdf
2. Introducing the Sequence Model for Text Retrieval
Yih-Kuen Tsay and Yu-Fang Chen
Department of Information Management, National Taiwan University
Abstract.
We propose and explore a novel approach, called the sequence model, to text retrieval. The model differs from classical ones in the extent of how positional information of term occurrences is used for relevance judgment. In the sequence model, documents and queries are viewed as sequences of term-position pairs and the relevance of a document to a query is judged by the similarity between their respective representative sequences. We suggest three primitive measures of sequence similarity, each capturing a distinct aspect of resemblance between two
sequences. These similarity measures can be combined in various ways to suit different information needs. We have developed a prototype system with the sequence model as its core. Experimental results show that our sequence-based approach is often more effective than appearance-based approaches.
http://www.csie.ntu.edu.tw/~ciet/form/paper/2.pdf
Wednesday, May 13, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment