Home
Download Lobo
Lobo Browser
Cobra Toolkit
  Download
  Getting Started
  HTML Parser
  API Docs
JavaFX, etc.
Source Code
Our Blog
Contact Us
SF Services
Donations
Thanks

 
SourceForge.net Logo
Lobo@SF
 
 
Support This Project
 
 

Cobra: Java HTML Parser

The all-Java, open source Cobra HTML Toolkit includes a HTML parser that can be used independently of the Cobra rendering engine. The following are some of its features:
  • It implements W3C HTML DOM Level 2 interfaces.
  • It parses "street HTML" as would be expected of a web browser.
  • It can be used in headless mode.
  • It provides incremental notifications of DOM modifications as the document is parsed.
  • It provides routines to incrementally modify the DOM, e.g. by setting the innerHTML property of an element.
  • It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled.
  • It is CSS-aware.

Cobra Version

Information provided in this page has been updated to apply to Cobra 0.98.1+. Cobra may be downloaded from the SourceForge download area for this project.

API Documentation

See the Cobra API Documentation.

Basic Usage

The recommended way to use the Cobra HTML parser is via the DocumentBuilderImpl class, roughly as follows:

import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
// A document URI and a charset should be provided.
Document document = dbi.parse(new InputSourceImpl(inputStream, documentURI, charset));

The HtmlParser class can be used directly as well. In particular, it can be used to parse an HTML document into a third-party DOM implementation, or to parse HTML below a particular DOM node (which is how the innerHTML property is implemented).

import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
import org.w3c.dom.html2.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
HTMLDocument document = (HTMLDocument) dbi.createDocument(inputSource);
...
HtmlParser parser = new HtmlParser(context, document);
parser.parse(myReader, someParentNode);

Incremental Notifications

A document notification listener can be added to a HTMLDocumentImpl instance by calling addDocumentNotificationListener(). The DocumentNotificationListener interface implementation will be notified of several types of document modifications as the document is parsed. Various notifications (intended to allow incremental rendering) can also occur as styles are modified or as the document is modified programmatically with Javascript.

In order to receive notifications you need to have a document instance before you start parsing.

Performance Tips

Parser performance is typically affected by loading of remote scripts and CSS documents. There are generally two ways to deal with this: (1) Disable Javascript and/or CSS, and (2) Implement some sort of caching mechanism.

All Cobra requests are processed through UserAgentContext.createHttpRequest(), so the way Cobra processes requests can be changed by either implementing the UserAgentContext and HttpRequest interfaces, or by extending simple implementations of these interfaces provided with Cobra.

Enabling of Javascript is controlled by the UserAgentContext.isScriptingEnabled() method, so it is straightforward to disable Javascript by calling a setter in SimpleUserAgentContext. Similarly, external CSS document loading is controlled by the UserAgentContext.isExternalCSSEnabled() method.

Customizing or Disabling Arbitrary Elements

Before disabling of CSS or Javascript were explicitly supported by Cobra, a general purpose technique could be implemented to achieve the same result. Essentially, the HTMLDocumentImpl class can be extended and its createElement, createText and other such methods can be overridden to provide custom node instances.

Examples

  1. Find images in a page.
    This example illustrates how to use the getImages() method of HTMLDocumentImpl to get a list of image elements in a page. This is equivalent to using the document.images property in Javascript.



  2. Using XPath and a XML DOM.
    HTMLDocumentImpl is a class that implements the standard W3C Document interface. So utilities that use Document, e.g. XPath, will work with documents parsed by Cobra. But it is also possible to use a standard XML Document instance in conjunction with the Cobra HTML parser, as this example illustrates. We will use XPath to retrieve all the "A" links from a page. (Note that Javascript won't work with a standard XML document.)



  3. Simulation of form submission.
    In Cobra, form submission is considered a functionality of the renderer, i.e. something you would expect to have in a browser. It is possible, nevertheless, to provide a simple implementation of the rendering context that does not do any rendering, in order to allow form submissions to occur. We will do the equivalent of invoking form.submit() in Javascript. This example illustrates how to retrieve the main page from the MetaCrawler search engine, populate its search form, submit it, and list the first-page results of the query. It also disables Javascript and external CSS.



See Also

External Links

Support The Project