Title:
Omini: A Fully Automated Object Extraction System for the World Wide Web

dc.contributor.author Buttler, David John en_US
dc.contributor.author Liu, Ling
dc.contributor.author Pu, Calton
dc.date.accessioned 2005-06-17T17:45:34Z
dc.date.available 2005-06-17T17:45:34Z
dc.date.issued 2000 en_US
dc.description.abstract This paper presents a fully automated object extraction system - Omini.A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 93% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization. en_US
dc.format.extent 531766 bytes
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/6590
dc.language.iso en_US
dc.publisher Georgia Institute of Technology en_US
dc.relation.ispartofseries CC Technical Report; GIT-CC-00-22 en_US
dc.subject Object extraction system
dc.subject Web page discovery
dc.title Omini: A Fully Automated Object Extraction System for the World Wide Web en_US
dc.type Text
dc.type.genre Technical Report
dspace.entity.type Publication
local.contributor.author Liu, Ling
local.contributor.author Pu, Calton
local.contributor.corporatename College of Computing
local.relation.ispartofseries College of Computing Technical Report Series
relation.isAuthorOfPublication 96391b98-ac42-4e2c-93ee-79a5e16c2dfb
relation.isAuthorOfPublication fc48a3de-da43-4d32-af59-414047eb7cd7
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isSeriesOfPublication 35c9e8fc-dd67-4201-b1d5-016381ef65b8
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
GIT-CC-00-22.pdf
Size:
519.3 KB
Format:
Adobe Portable Document Format
Description: