08-02-2012, 11:52 AM
Text Mining
[attachment=17268]
Text Databases
.Consists of large collections of documents from various sources. Eg- articles, books, research papers, digital libraries, etc…
.Semistructured data
.Document contains few structured fields such as title,authors and unstructured text components such as abstract and contents.
.Information retrival techniques such as indexing methods have been developed to handle unstructured documents.
Information Retrieval(IR)
.It is a field that has been developing in parallel with database systems.
.Database systems focused on query and transaction processing on structured data.
.Information retrieval focused on organization and retrieval of information from a large number of text-based documents.
F-score
Its a trade off recall for precision and vice versa.
It’s a harmonic mean of precision and recall
It discourages a system that sacrifices one measure for another.
Document Selection
.Query is used to specifying constraints for selecting relevant documents
.Boolean Model
.Document is represented as set of keywords and user provides a boolean expression of keywords.
Eg: tea or coffee, database systems but not DB2.
.Retrieval system would take such a boolean query and return documents that satisfies the boolean query.
.Works well when the user knows lot about the document collection.
[attachment=17268]
Text Databases
.Consists of large collections of documents from various sources. Eg- articles, books, research papers, digital libraries, etc…
.Semistructured data
.Document contains few structured fields such as title,authors and unstructured text components such as abstract and contents.
.Information retrival techniques such as indexing methods have been developed to handle unstructured documents.
Information Retrieval(IR)
.It is a field that has been developing in parallel with database systems.
.Database systems focused on query and transaction processing on structured data.
.Information retrieval focused on organization and retrieval of information from a large number of text-based documents.
F-score
Its a trade off recall for precision and vice versa.
It’s a harmonic mean of precision and recall
It discourages a system that sacrifices one measure for another.
Document Selection
.Query is used to specifying constraints for selecting relevant documents
.Boolean Model
.Document is represented as set of keywords and user provides a boolean expression of keywords.
Eg: tea or coffee, database systems but not DB2.
.Retrieval system would take such a boolean query and return documents that satisfies the boolean query.
.Works well when the user knows lot about the document collection.