Student Seminar Report & Project Report With Presentation (PPT,PDF,DOC,ZIP)

Full Version: INFORMATION RETREIVAL ppt.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.


[attachment=8481]


BY
Sudheer reddy . B


Agenda

Definition
History
Overview
Performance Measures
What IR Do-How ?
Traditional View of IR

History :

The idea of using computers to search for relevant pieces of information was popularized by the article “As We May Think” by Vannevar Bush in 1945.
The first automated information retrieval systems were introduced in the 1950s and 1960s.
In 1992, the US Department of Defense along with the NIST cosponsored the Text Retrieval Conference(TREC) program-Web Search Engines.

Overview :

An information retrieval process begins when a user enters a Query into the system.
Process may then be iterated if the user wishes to refine the query.

What IR Systems Try to Do ?

Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time.

How IR Systems Try to Do This

Represent the user’s information problem (the query)
Represent (surrogate) and organize (classify) the contents of the knowledge resource
Compare query to surrogates (predict relevance)
Present results to the user for interaction/judgment

Performance measures :

Traditional goal of IR is to retrieve all and only the relevant IOs in response to a query.
All is measured by recall: the proportion of relevant IOs in the collection which are retrieved
Only is measured by precision: the proportion of retrieved IOs which are relevant



[attachment=9990]
Information Retrieval Systems
n Information retrieval (IR) systems use a simpler data model than database systems
l Information organized as a collection of documents
l Documents are unstructured, no schema
n Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents
l e.g., find documents containing the words “database systems”
n Can be used even on textual descriptions provided with non-textual data such as images
n Web search engines are the most familiar example of IR systems
n Differences from database systems
l IR systems don’t deal with transactional updates (including concurrency control and recovery)
l Database systems deal with structured data, with schemas that define the data organization
l IR systems deal with some querying issues not generally addressed by database systems
n Approximate searching by keywords
n Ranking of retrieved answers by estimated degree of relevance
Keyword Search
n In full text retrieval, all the words in each document are considered to be keywords.
l We use the word term to refer to the words in a document
n Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not
l Ands are implicit, even if not explicitly specified
n Ranking of documents on the basis of estimated relevance to a query is critical
l Relevance ranking is based on factors such as
 Term frequency
– Frequency of occurrence of query keyword in document
 Inverse document frequency
– How many documents the query keyword occurs in
» Fewer è give more importance to keyword
 Hyperlinks to documents
– More links to a document è document is more important
Relevance Ranking Using Terms
n TF-IDF (Term frequency/Inverse Document frequency) ranking:
l Let n(d) = number of terms in the document d
l n(d, t) = number of occurrences of term t in the document d.
l Relevance of a document d to a term t
 The log factor is to avoid excessive weight to frequent terms
Relevance of document to query Q
n Most systems add to the above model
l Words that occur in title, author list, section headings, etc. are given greater importance
l Words whose first occurrence is late in the document are given lower importance
l Very common words such as “a”, “an”, “the”, “it” etc are eliminated
 Called stop words
l Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart
n Documents are returned in decreasing order of relevance score
l Usually only top few documents are returned, not all
Similarity Based Retrieval
n Similarity based retrieval - retrieve documents similar to a given document
l Similarity may be defined on the basis of common words
 E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.
n Relevance feedback: Similarity can be used to refine answer set to keyword query
l User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these
n Vector space model: define an n-dimensional space, where n is the number of words in the document set.
l Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t )
l The cosine of the angle between the vectors of two documents is used as a measure of their similarity.