자바 html parser

etc tools 2005. 10. 20. 19:11

Alternative HTML Parsers

This package was originally written in the latter half of 2002. At that time I evaluated 6 other parsers, none of which were capable of achieving my aims. Most couldn't reproduce a typical HTML document without change, none could reproduce a source document containing badly formatted or non-HTML components without change, and none provided a means to track the positions of nodes in the source text. A list of these parsers and a brief description follows, but please note that I have not revised this analysis since the before this package was written. Please let me know if there are any errors.

  • JavaCC HTML Parser by Quiotix Corporation (http://www.quiotix.com/downloads/html-parser/)
    GNU GPL licence, expensive licence fee to use in commercial application. Does not support document structure (parses into a flat node stream).
  • Demonstrational HTML 3.2 parser bundled with JavaCC. Virtually useless.
  • JTidy (http://jtidy.sourceforge.net/)
    Supports document structure, but by its very nature it "tidies" up anything it doesn't like in the source document. On first glance it looks like the positions of nodes in the source are accessible, at least in protected start and end fields in the Node class, but these are pointers into a different buffer and are of no use.
  • javax.swing.text.html.parser.Parser
    Comes standard in the JDK. Supports document structure. Does not track the positions of nodes in the source text, but can be easily modified to do so (although not sure of legal implications of modifications). Requires a DTD to function, but only comes with HTML3.2 DTD which is unsuitable. Even if an HTML 4.01 DTD were found, the parser itself might need tweaking to cater for the new element types. The DTD needs to be in the format of a "bdtd" file, which is a binary format used only by Sun in this parser implementation. I have found many requests for a 4.01 bdtd file in newsgroups etc on the web, but they all reamain unanswered. Building it from scratch is not so easy.
  • Kizna HTML Parser v1.1 (http://htmlparser.sourceforge.net/)
    GNU LGPL licence. Version 1.1 was very simple without support for document structure. I have since revisited this project at sourceforge (early 2004), where version 1.4 is now available. There are now two separate libraries, one with and one without document structure support. It claims to now also be capable of reproducing source text verbatim.
  • CyberNeko HTML Parser (http://www.apache.org/~andyc/neko/doc/html/index.html)
    Apache-style licence. Supports document structure. Based on the very popular Xerces XML parser. At the time of evaluation this parser didn't regenerate the source accurately enough.

 

출처  - http://jerichohtml.sourceforge.net/

 

Open Source HTML Parsers in Java

NekoHTML

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

Go To NekoHTML

HTML Parser

A fast real-time parser for real-world HTML.

Go To HTML Parser

Java HTML Parser

HTML Parser that produces a stream of tag objects, which can be further parsed into a searchable tree structure.

Go To Java HTML Parser

Jericho HTML Parser

A simple but powerful java library for parsing and modifying HTML documents, including analysis of abritrary HTML forms to determine the structure of submitted data.

Go To Jericho HTML Parser

JTidy

JTidy is a Java port of HTML Tidy , a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

Go To JTidy

TagSoup

TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

Go To TagSoup

HotSax

HotSAX is a fast, small footprint, non-validating SAX2 parser for HTML/XML/XHTML. It can be used in simple web agents, page scrapers, and spiders. It is similar to the Apache Xerces parser, except that it can generate SAX events for badly formatted HTML as well.

Go To HotSax

 

출처 http://java-source.net/open-source/html-parsers

'etc tools' 카테고리의 다른 글

간단하게 사용가능한 ctags 사용  (0) 2006.04.14
[펌] HTML 특수문자표  (0) 2005.11.01
C#과 자바의 비교.  (0) 2005.06.14
[팁] ant에서 alias 사용하기  (0) 2005.06.14
[팁] ant, conditional compiling 조건 수행  (0) 2005.06.14
Posted by '김용환'
,