About | FAQ | Backlog
Open Source projects, categorized.
add filters by typing...
...or clicking
.net antconc author bigrams c# cil class-library compare compiler concordance concordancer content corpora corpus fingerprinting high-performance il keywords lexical lexical-analysis mono multi-platform ngram ngrams open-source open-vocabulary pentagrams quadgrams rdoc regular-expressions rspec ruby rubygem similarity syntactical tokenizer tools trigrams wordlist wordlister word-processors wordsmith wordsmith-tools
[1 users on Ohloh]
CORSIS (formerly Tenka Text) is a performance‐oriented, open‐source library for corpus analysis. It utilizes typed assembly, task‐specific compilers and parallelization to deliver the best performance with elegant design. Demonstrative GUI of the project comes with Wordlister - an advanced, extremely fast graphical wordlist tool and a regex concordance tool. CORSIS - the open-source answer to WordSmith Tools.
[1 users on Ohloh]
Raingrams is a flexible and general-purpose ngrams library written in Ruby. Raingrams supports ngram sizes greater than 1, text/non-text grams, multiple parsing styles and open/closed vocabulary models.
[0 users on Ohloh]
Tags: compare content analysis text similarity syntactical
Compute syntactical similarity of the text. Java program that compares two files and return - in percentage - how similar they are.

So for example: java -jar ss.jar c:/tmp/a.txt c:/tmp/b.txt
Output would be: Similarity is 89.60159%
Some texts are too similar to each other, like almost! duplicated news articles for example. The difference could be that in the middle of the text is different advertisement or just headline is slightly modified.