TDG Juan CarlosRoldán Committed to research! TDG Site Manager 1.1

Kizomba

What is Kizomba?

The Web is an ever growing repository of valuable information. That information lacks semantics since it is buried into web documents that are represented using HTML. Information extractors are software components that help software engineers in the task of extracting structured information from web documents.

There are many proposals in the literature. They can be classified as rule-based or heuristic-based. Rule-based information extractors require the user to provide information extraction rules that can be handcrafted, learnt supervisedly or unsupervisedly. Heuristic-based information extractors have built-in heuristics that allow them to extract information without any user intervention. We think that the most appropriate approaches nowadays are rule-based proposals whose rules are learnt unsupervisedly or heuristic-based proposals since they require little or no user intervention, which makes them appropriate to extract information from the Web. Current proposals in this category include RoadRunner [1], ExAlg [2], FiVaTech [3], and Trinity [4].

RoadRunner works on a collection of documents which are compared to infer union-free regular expressions to describe the input template. ExAlg works in two stages: it first computes large and frequently occurring equivalence classes of tokens and then learns a regular expression and a data schema from them. FiVaTech first identifies nodes in the input DOM trees that have a similar structure and then aligns their children and mines repetitive and optional patterns to create the extraction rule. Trinity, partitions the input documents into a trinary tree, which is traversed to build a regular expression with capturing groups.

The previous proposals have some problems regarding extracted rules. None of them is able to deal with all the particularities of typical web documents, namely: nested attributes, multi-valued attributes, attribute permutations, and unique data records. Furthermore, their complexity has not been analysed in most cases.

Further information can be found in our Doctoral Consortium article in PAAMS Conference.

Current process

Kizomba is a new heuristic-based web information extractor that is based on a pipeline of heuristics, which are rules that can modify both the documents and the extracted information. The heuristics can be classified into three categories: pre-processing, extraction, and post-processing heuristics. Pre-processing heuristics modify the content of the documents. Extraction heuristics identify common web structures, extract the underlying information, and remove that structures from the documents. Post-processing heuristics apply corrections to the extracted information.

Currently, the heuristics which are being applied are the following ones:

Pre-processing heuristics

  1. UnwrapStyleHeuristic: Unwrap styling tags (abbr, b, big, br, center, dbo, del, dfn, em, font, hr, i, ins, kbd, mark, nobr, p, q, s, samp, small, strike, strong, sub, sup, super, tt, u, var, wbr)
  2. AttributesIntoNodesHeuristic: Transform some attributes (src, href and content) into HTML nodes to apply the next heuristic successfully.
  3. TransformBackgroundHeuristic: Transform background-image attributes into img, if the content changes across the documents.
  4. ModifiedHeuristic: Iterates over the documents to identify which nodes does changes across the documents, and label it as modified. A node which changes is a node whose ownText is not the same on all the documents, or does not appear on some other document. Data-changed, data-child-changed
  5. RemoveUselessHeuristic: Remove meta with no content change across documents, script, style and forms with the same text across documents).
  6. NormalizeTableHeuristic: Expand rowspan and colspan attributes.
  7. RemoveRepeatingColumns: Remove columns with the same content across documents (or absent).
  8. AdBlockHeuristic: Apply AdBlock filter lists.
  9. RenderToTableHeuristic: Analyse divs and spans rendering to identify tabular structures and transform into tables.

Extraction heuristics

  1. KeyValueTableHeuristic: Analyse repetition ratio on HTML tables to extract information from tables with 2 columns; keys and values.
  2. ComparisonTableHeuristic: Analyse repetition ratio on HTML tables to extract information from comparison tables.
  3. GlobalInformationHeuristic: Extract information from meta descriptions, meta titles, meta keywords, headers and titles if the information changes across the documents, and remove the common parts.
  4. ImageHeuristic: Extract images if the information changes across the documents.
  5. BreadcrumbsHeuristic: Extract breadcrumbs using a classifier.
  6. DescriptionHeuristic: Extract description using a classifier.
  7. CommentHeuristic: Extract comment lists using a classifier.
  8. ListHeuristic: Extract lists with different content across the documents.
  9. DescriptionTableHeuristic: Extract information from description tables.
  10. MicrodataHeuristic: Extract microdata information.
  11. RDFaHeuristic: Extract RDFa information.
  12. GeneralHeuristic: Extract remaining information from ModifiedHeuristic.

Post-processing heuristics

  1. TransformKeyValueHeuristic: Transform key-value lists into key: value.
  2. RemoveTrailingCharactersHeuristic: Remove repeating characters across the records at the start and the end of the keys and values.
  3. HierarchiseHeuristic: Hierarchise anonymous records using their CSS selector.

References

[1] Automatic information extraction from large websites. Valter Crescenzi, Giansalvatore Mecca. J. ACM, 51(5):731-779, 2004

[2] Extracting unstructured data from template generated web documents. Ling Ma, Nazli Goharian, Abdur Chowdhury, Misun Chung. CIKM, 512-515, 2003

[3] FiVaTech: Page-Level Web Data Extraction from Template Pages. Mohammed Kayed, Chia-Hui Chang. IEEE Trans. Knowl. Data Eng., 22(2):249-263, 2010

[4] Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction. Hassan A. Sleiman, Rafael Corchuelo. IEEE Trans. Knowl. Data Eng., 26(6):1544-1556, 2014

 

Juan C. Roldán, 15/06/2016

  1. TextFixHeuristic: Dirty fix over text
  2. UnwrapStyleHeuristic: Unwrap styling tags (abbr, b, big, br, center, dbo, del, dfn, em, font, hr, i, ins, kbd, mark, nobr, p, q, s, samp, small, strike, strong, sub, sup, super, tt, u, var, wbr)
  3. AttributesIntoNodesHeuristic: Transform some attributes (src, href and content) into HTML nodes to apply the next heuristic successfully.
  4. TransformBackgroundHeuristic: Transform background-image attributes into img, if the content changes across the documents.
  5. ModifiedHeuristic: Iterates over the documents to identify which nodes does changes across the documents, and label it as modified. A node which changes is a node whose ownText is not the same on all the documents, or does not appear on some other document. Data-changed, data-child-changed
    • Problem: Duplicated IDs, such as postAdButton in backpage
    • Problem: Namespaces, such as g:plusone in floridagunclassifieds
    • Problem: Slow extractions, such as elpasoguntrader and Waterstones. Seems that it is something related to anonymized selectors. Add visited list and pre-processing heuristic.
    • Problem: Ids with invalid characters (#location_id[])
  6. RemoveUselessHeuristic: Remove meta with no content change across documents, script, style and forms with the same text across documents).
  7. NormalizeTableHeuristic: Expand rowspan and colspan attributes.
  8. RemoveRepeatingColumns: Remove columns with the same content across documents (or absent).
  9. AdBlockHeuristic: Apply AdBlock filter lists.
    • Test with: Cars/www.carzone.ie
  10. RenderToTableHeuristic: Analyse divs and spans rendering to identify tabular structures and transform into tables.
    • Test with: www.manybooks.net