|
Cobra: Java HTML Parser
The all-Java, open source Cobra HTML Toolkit includes a
HTML parser that can be used independently of the Cobra rendering engine. The
following are some of its features:
- It implements W3C HTML DOM Level 2 interfaces.
- It parses "street HTML" as would be expected of a web browser.
- It can be used in headless mode.
- It provides incremental notifications of DOM modifications as the document is parsed.
- It provides routines to incrementally modify the DOM, e.g. by
setting the
innerHTML property of an element.
- It is Javascript-aware. DOM modifications that occur during parsing will
be reflected in the resulting DOM. However, Javascript can be disabled.
- It is CSS-aware.
Cobra Version
Information provided in this page has been updated to
apply to Cobra 0.98.1+. Cobra may be downloaded from
the SourceForge
download area
for this project.
API Documentation
See the Cobra API Documentation.
Basic Usage
The recommended way to use the Cobra HTML parser is via the DocumentBuilderImpl class,
roughly as follows:
import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
// A document URI and a charset should be provided.
Document document = dbi.parse(new InputSourceImpl(inputStream, documentURI, charset));
The
HtmlParser
class can be used directly as well. In particular, it can be used to parse
an HTML document into a third-party DOM implementation, or to parse HTML
below a particular DOM node (which is how the innerHTML property
is implemented).
import org.lobobrowser.html.parser.*;
import org.lobobrowser.html.test.*;
import org.lobobrowser.html.*;
import org.w3c.dom.*;
import org.w3c.dom.html2.*;
...
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
HTMLDocument document = (HTMLDocument) dbi.createDocument(inputSource);
...
HtmlParser parser = new HtmlParser(context, document);
parser.parse(myReader, someParentNode);
Incremental Notifications
A document notification listener can be added to a HTMLDocumentImpl
instance by calling
addDocumentNotificationListener().
The DocumentNotificationListener interface implementation will
be notified of several types of document modifications as the document is parsed.
Various notifications (intended to allow incremental rendering) can also occur
as styles are modified or as the document is modified programmatically with
Javascript.
In order to receive notifications you need to have a document instance
before you start parsing.
Performance Tips
Parser performance is typically affected by loading of remote
scripts and CSS documents. There are generally two ways to
deal with this: (1) Disable Javascript and/or CSS, and
(2) Implement some sort of caching mechanism.
All Cobra requests are processed through
UserAgentContext.createHttpRequest(), so the way
Cobra processes requests can be changed by either implementing the UserAgentContext
and HttpRequest interfaces, or by extending simple
implementations of these interfaces provided with Cobra.
Enabling of Javascript is controlled by the
UserAgentContext.isScriptingEnabled() method, so
it is straightforward to disable Javascript by calling a
setter in SimpleUserAgentContext. Similarly,
external CSS document loading is controlled by the
UserAgentContext.isExternalCSSEnabled() method.
Customizing or Disabling Arbitrary Elements
Before disabling of CSS or Javascript were explicitly
supported by Cobra, a general purpose technique could be
implemented to achieve the same result. Essentially,
the HTMLDocumentImpl class can be extended
and its createElement, createText
and other such methods can be overridden to provide
custom node instances.
Examples
- Find images in a page.
This example illustrates how to use the getImages()
method of HTMLDocumentImpl to get a list of image
elements in a page. This is equivalent to using the
document.images property in Javascript.
- Using XPath and a XML DOM.
HTMLDocumentImpl is a class that implements the
standard W3C Document interface. So utilities
that use Document, e.g. XPath, will work with documents
parsed by Cobra. But it is also possible to use a standard XML
Document instance in conjunction with the Cobra
HTML parser, as this example illustrates. We will use XPath
to retrieve all the "A" links from a page. (Note that Javascript
won't work with a standard XML document.)
- Simulation of form submission.
In Cobra, form submission is considered a functionality
of the renderer, i.e. something you would expect to have
in a browser. It is possible, nevertheless, to provide a simple
implementation of the rendering context that does not do
any rendering, in order to allow form submissions to occur.
We will do the
equivalent of invoking form.submit() in Javascript.
This example illustrates how to retrieve the main page
from the MetaCrawler search engine, populate its search
form, submit it, and list the first-page results of the
query. It also disables Javascript and external CSS.
See Also
External Links
Support The Project
|