Tool Support for Topic Modeling and Interactive Visualization for Large Document Corpus
Advisor Information
Myoungkyu Song
Presentation Type
Poster
Start Date
26-3-2021 12:00 AM
End Date
26-3-2021 12:00 AM
Abstract
A vast amount of document collections is becoming available in large repositories with the rapid growth of hardware platforms and software technology for the world wide web. The National Center for Biotechnology Information (NCBI) repository provides millions of bibliographic documents. Many researchers often need to inspect these large document collections to understand datasets and make critical decisions. For example, when bioengineering researchers explore underlying biological mechanisms of biofilms, their activities need to frequently identify and understand a comprehensive body of published literature, studying and identifying relations between material features of interest.
However, these researchers typically have a great difficulty exploring such ever-growing big datasets when they survey and evaluate new information in the existing bioengineering literature. For example, questions such as, “What is the useful information, and which is repeatedly used in the context of different kinds of documents?” and “What are the relationships between topics X, Y, and Z across different documents?” on a text corpus cannot be often answered, even when advanced ranking techniques are used.
The increasing amount of text data creates a need for advanced approaches that can learn interesting and important patterns from the data. Structured data can be managed by a database; however, for unstructured text data, approximate keyword searching, and random browsing are usually used to manage and find useful information from a collection. Many research topics in the field of information retrieval (IR) have been discussed, such as text clustering, text categorization, summarization, and recommendation. However, IR has usually focused on facilitating information access rather than identifying common patterns. The primary goal of text mining assists users to analyze data sets, understand semantic relations between characteristics, and facilitate decision making.
Tool Support for Topic Modeling and Interactive Visualization for Large Document Corpus
A vast amount of document collections is becoming available in large repositories with the rapid growth of hardware platforms and software technology for the world wide web. The National Center for Biotechnology Information (NCBI) repository provides millions of bibliographic documents. Many researchers often need to inspect these large document collections to understand datasets and make critical decisions. For example, when bioengineering researchers explore underlying biological mechanisms of biofilms, their activities need to frequently identify and understand a comprehensive body of published literature, studying and identifying relations between material features of interest.
However, these researchers typically have a great difficulty exploring such ever-growing big datasets when they survey and evaluate new information in the existing bioengineering literature. For example, questions such as, “What is the useful information, and which is repeatedly used in the context of different kinds of documents?” and “What are the relationships between topics X, Y, and Z across different documents?” on a text corpus cannot be often answered, even when advanced ranking techniques are used.
The increasing amount of text data creates a need for advanced approaches that can learn interesting and important patterns from the data. Structured data can be managed by a database; however, for unstructured text data, approximate keyword searching, and random browsing are usually used to manage and find useful information from a collection. Many research topics in the field of information retrieval (IR) have been discussed, such as text clustering, text categorization, summarization, and recommendation. However, IR has usually focused on facilitating information access rather than identifying common patterns. The primary goal of text mining assists users to analyze data sets, understand semantic relations between characteristics, and facilitate decision making.