skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: ISOLATING CONTENT AND METADATA FROM WEBLOGS USING CLASSIFICATION AND RULE-BASED APPROACHES

Conference ·
OSTI ID:1118123

The emergence and increasing prevalence of social media, such as internet forums, weblogs (blogs), wikis, etc., has created a new opportunity to measure public opinion, attitude, and social structures. A major challenge in leveraging this information is isolating the content and metadata in weblogs, as there is no standard, universally supported, machine-readable format for presenting this information. We present two algorithms for isolating this information. The first uses web block classification, where each node in the Document Object Model (DOM) for a page is classified according to one of several pre-defined attributes from a common blog schema. The second uses a set of heuristics to select web blocks. These algorithms perform at a level suitable for initial use, validating this approach for isolating content and metadata from blogs. The resultant data serves as a starting point for analytical work on the content and substance of collections of weblog pages.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1118123
Report Number(s):
PNNL-SA-79748
Resource Relation:
Conference: Proceedings of the IADIS International Conferences: Web Based Communities and Social Media 2011, Collaborative Technologies 2011 and Internet Applications and Research 2011, July 22-24, 2011, Rome, Italy, 187-191
Country of Publication:
United States
Language:
English