A Java library for parsing, validating and manipulating XML documents. The latest version released, 2.9.1, provides support for the following standards and APIs:
* XML 1.0 (4th Edition)
* Namespaces in XML 1.0 (2nd Edition)
* XML 1.1 (2nd Edition)
* Namespaces in XML 1.1 (2nd Edition)
* W3C XML Schema 1.0 (2nd Edition)
* XInclude 1.0 (2nd Edition)
* OASIS XML Catalogs 1.1
* SAX 2.0.2
...
Libxml2 is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform).
Includes the xmllint tool for checking documents for well-formedness, validating documents against a DTD or XML Schema, and pretty printing XML input.
dom4j is an easy to use, open source library for working with XML, XPath, and XSLT on the Java platform, using the Java Collections Framework, and with full support for DOM, SAX, and JAXP.
Hpricot is a very flexible HTML parser, based on Tanaka Akira's HTree and John Resig's JQuery, but with the scanner recoded in C (using Ragel for scanning.) I've borrowed what I believe to be the best ideas from these wares to make Hpricot heaps of fun to use.
The SAXON package is a collection of tools for processing XML documents.
Xerces-C++ is a validating XML parser written in a portable subset of C++. A shared library is provided for parsing, generating, manipulating, and validating XML documents. Xerces-C++ is faithful to the XML 1.0 recommendation and many associated standards. Source code, samples and API documentation are provided with the parser. For portability, care has been taken to make minimal use of templates, no RTTI, and minimal use of #ifdefs.
ANother Tool for Language Recognition (ANTLR) is the name of a parser generator that uses LL(k) parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is professor Terence Parr of the University of San Francisco.
Rome is a set of Atom/RSS Java utilities that make it easy to work in Java with most syndication formats. Today it accepts all flavors of RSS (0.90, 0.91, 0.92, 0.93, 0.94, 1.0 and 2.0) and Atom 0.3 feeds. Rome includes a set of parsers and generators for the various flavors of feeds, as well as converters to convert from one format to another. The parsers can give you back Java objects that are either specific for the format you want to work with, or a generic normalized SyndFeed object that le...
Expat is a fast, non-validating, stream-oriented XML parsing library.
Parse, Analyze and Manipulate Perl (without perl)
The ability to read, and manipulate Perl (the language) programmatically other than with perl (the application) was one that caused difficulty for a long time.
The cause of this problem was Perl's complex and dynamic grammar. Although there is typically not a huge diversity in the grammar of most Perl code, certain issues cause large problems when it comes to parsing.
Indeed, quite early in Perl's history Tom Christenson introdu...
Java Compiler Compiler is the most popular parser generator for use with Java applications. A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building (via a tool called JJTree included with JavaCC), actions, debugging, etc.
Sparse, the semantic parser, provides a compiler frontend capable of parsing most of ANSI C as well as many GCC extensions, and a collection of sample compiler backends, including a static analyzer also called "sparse". Sparse provides a set of annotations designed to convey semantic information about types, such as what address space pointers point to, or what locks a function acquires or releases.
Parse RSS and Atom feeds in Python
NLTK — the Natural Language Toolkit — is a suite of open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.
Parsec is designed from scratch as an industrial-strength parser library. It is simple, safe, well documented (on the package homepage), has extensive
libraries and good error messages, and is also fast.
Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).
SimplePie puts the 'simple' back into 'really simple syndication'. Flexible enough to suit newbies and veterans alike, SimplePie's focus has been two-fold: speed and ease of use. By thinking about the most useful ways to handle blogs, news sites, and podcasts, we've come up with an API that makes it easy to do cool things with your feeds.
Spirit is an object-oriented, recursive descent parser generator framework implemented using template meta-programming techniques. Expression templates allow Spirit to approximate the syntax of Extended Backus Normal Form (EBNF) completely in C++. The Spirit framework enables a target grammar to be written exclusively in C++. EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable.
Apache PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities.
* PDF to text extraction
* Merge PDF Documents
* PDF Document Encryption/Decryption
* Lucene Search Engine Integration
* Fill in form data FDF and XFDF
...
Texy is one of the most complex lightweight markup language. It allows adding of images, links, nested lists, tables and has full support for typography and CSS.
Texy allows you to enter content using an easy to read Texy syntax which is filtered into structurally valid XHTML. No knowledge of HTML is required.