CORSIS (formerly Tenka Text) is a performance‐oriented, open‐source library for corpus analysis. It utilizes typed assembly, task‐specific compilers and parallelization to deliver the best performance with elegant design. Demonstrative GUI of the project comes with Wordlister - an advanced, extremely fast graphical wordlist tool and a regex concordance tool. CORSIS - the open-source answer to WordSmith Tools.
Raingrams is a flexible and general-purpose ngrams library written in Ruby. Raingrams supports ngram sizes greater than 1, text/non-text grams, multiple parsing styles and open/closed vocabulary models.
Compute syntactical similarity of the text. Java program that compares two files and return - in percentage - how similar they are.

So for example: java -jar ss.jar c:/tmp/a.txt c:/tmp/b.txt
Output would be: Similarity is 89.60159%
Some texts are too similar to each other, like almost! duplicated news articles for example. The difference could be that in the middle of the text is different advertisement or just headline is slightly modified.