Wednesday, May 13, 2009

Getting to the passage level

N-gram: a language independent approach to IR and NLP
P Majumder, M Mitra, B.B. Chaudhuri
Computer vision and pattern recognition Unit
Indian Statistical Institute, Kolkata
mandar@isical.ac.in

Abstract
With the increasingly widespread use of computers & the Internet in India, large amounts of information inIndian languages are becoming available on the web. Automatic information processing and retrieval is therefore becoming an urgent need in the Indian context. Moreover, since India is a multilingual country, any effective approach to IR in the Indian context needs to be capable of handling a multilingual collection of documents. In this paper, we discuss the N-gram approach to developing some basic tools in the area of IR and NLP. This approach is statistical and language independent in nature, and therefore eminently suited to the multilingual Indian context. We first present a brief survey of some language-processing applications in which N-grams have been successfully used. We also present the results of some preliminary experiments on using N-grams for identifying the language of an Indian language document,
based on a method proposed by Cavnar et al [1].

http://www.unl.fi.upm.es/consorcio/archivos/publicaciones/goa/paper15.pdf

2. Introducing the Sequence Model for Text Retrieval

Yih-Kuen Tsay and Yu-Fang Chen
Department of Information Management, National Taiwan University

Abstract.
We propose and explore a novel approach, called the sequence model, to text retrieval. The model differs from classical ones in the extent of how positional information of term occurrences is used for relevance judgment. In the sequence model, documents and queries are viewed as sequences of term-position pairs and the relevance of a document to a query is judged by the similarity between their respective representative sequences. We suggest three primitive measures of sequence similarity, each capturing a distinct aspect of resemblance between two
sequences. These similarity measures can be combined in various ways to suit different information needs. We have developed a prototype system with the sequence model as its core. Experimental results show that our sequence-based approach is often more effective than appearance-based approaches.

http://www.csie.ntu.edu.tw/~ciet/form/paper/2.pdf

Tuesday, April 21, 2009

Adding new documents to database

http://trec.nist.gov/pubs/trec1/papers/11.txt

Friday, March 20, 2009

DOCUMENT CLUSTERING

A comparison of document clustering techniques (2000)by Michael Steinbach, George Karypis, Vipin Kumar In KDD Workshop on Text Mining http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.9225

Projections for efficient document clustering (1997) by Hinrich Schutze, Craig Silverstein http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.8901

Latent semantic indexing (lsi): Trec-3 report (1995) [40 citations — 0 self] Download: by Susan T. Dumais http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.2241

Maximum likelihood from incomplete data via the EM algorithm (1977) by A. P. Dempster, N. M. Laird, D. B. Rdin Journal of the Royal Statistical Society, Series B http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.4884

Indexing by latent semantic analysis (1990) [1633 citations — 22 self] Download: by Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. L, Richard Harshman http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8490

An algorithm to cluster documents based on relevance (2005) [2 citations — 0 self] Download: by Monica Desai, Amanda Spink Information Processing and Management http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.3463

A modified fuzzy art for soft document clustering (2002) [8 citations — 1 self] Download: by Ravikumar Kondadadi, Robert Kozma http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.6003

Friday, February 27, 2009

Read

A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese (1994)
by Makoto Nagao, Shinsuke Mori
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.8416

Class-Based n-Gram Models of Natural Language (1992)
by Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Jenifer C. LaiComputational Linguistics
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9919

Beyond word n-grams (1995)
by Fernando C. Pereira, Yoram Singer, Naftali TishbyIn Proceedings of the Third Workshop on Very Large Corpora
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.7

Friday, February 20, 2009

next in line

N-GRAM AND LOCAL CONTEXT ANALYSIS FOR PERSIAN TEXT RETRIEVAL
by Farhad Oroumchian A, Abolfazl Aleahmad A, Parsia Hakimian A, Farzad Mahdikhani A http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.8268

Modeling documents for structure recognition using generalized n-grams (1997) by R. Brugger, A. Zramdini, R. Ingoldin
Proceedings of ICDAR
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.2338

Discovering characteristic expressions from literary works: A new text analysis method beyond n-gram and kwic (2001)
by Masayuki Takeda, Tetsuya Matsumoto, Tomoko Fukuda, IchirĂ… Nanri
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.5337

currently reading

Information Retrieval

http://www.cc.gatech.edu/~isbell/tutorials/TextRetrieval.fm.pdf

Understanding Inverse Document Frequency:On theoretical arguments for IDF

Stephen RobertsonMicrosoft Research7 JJ Thomson AvenueCambridge CB3 0FBUK(and City University, London, UK)

http://www.soi.city.ac.uk/~ser/idfpapers/Robertson_idf_JDoc.pdf#search=

Using TF-IDF to Determine Word Relevance in Document Queries

Juan Ramos Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855

http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf#search=

TEXT MINING

Ian H. WittenComputer Science, University of Waikato, Hamilton, New Zealand