list of natural language processing resources and tools

All about language programs, courses, websites and other learning resources
White Belt
Posts: 42
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 88

list of natural language processing resources and tools

Postby mcthulhu » Sat May 06, 2017 12:12 pm

I thought I would offer a few additions to reineke's list of language tools at viewtopic.php?f=19&t=2900. I have my own collection of bookmarks and notes on Natural Language Processing (NLP) tools; this includes some Dutch support per zhuzilu's recent question.

"Tool" is a broad concept; these do a wide variety of useful things. Some of these are ready to use or even have Web interfaces; some are software libraries meant to support other tools, in a variety of programming languages. The range of natural languages covered out-of-the-box by various tools varies wildly. Some of these links I probably found here. I note them as I come across them and have not attempted to organize them. This list probably barely scratches the surface; there is a lot out there, with more coming all the time. I'm still exploring, and will probably end up integrating some of the tools below with mine (which is the main reason I've been exploring - I don't want to recreate components that are already available).
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

See also Dutch tools available in a VirtualBox image, etc. - Google Cloud Natural Language API provides analysis of entities, sentiment, and syntax, including POS, lemma, and morphology for several languages; nice sentence diagrams! Supports Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Spanish. requires Google Cloud account - high-quality TTS voices for 24 languages; requires Amazon Web Services account.

Amazon Lex - for building voice and text chatbots. - machine translation for about 100 languages, I think (maybe too well known to list)
A straightforward & modular NLP, machine learning & fuzzy matching library for JavaScript. - English language only
Voyant Tools is a web-based reading and analysis environment for digital texts; supports 10 languages, including Japanese. - links to a number of software tools for corpus processing - Gensim is a Python library for topic modeling, etc. - NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system. NoSketch Engine is a limited version of the software empowering the famous Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and many other excellent features. a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures. It utilitates a fast indexing library called Finlib. - "industrial-strength natural language processing" ; Python library, features include tokenization, sentence segmentation, word vectors, POS tagging, named entity recognition; statistical models for English and German - German morphological analysis Berkeley word aligner - Simple, fast unsupervised word aligner - platform dedicated to ANLP (Arabic Natural Language Processing) ... -learners/ - MS blog post on Engkoo software aids for Chinese learners of English; “It unifies human translation mined from the web, machine translation, and a language-learning experience into one user-friendly search-and-explore interface.... The technology itself is language independent and can be extended to other language pairs in the future.” Something to keep an eye on. ... -to-speech ... synthesis/
dsl2mobi (inflection word lists) - sentence pairs linked to TED etc.

Pattern (Python 2.7 NLP library) supports Dutch, English, Spanish, German, French and Italian.
NLTK, Natural Language Toolkit, the main Python library for NLP - ... 2prxyeds5k - German NLTK; GermaNLTK is an integration of GermaNet and Projekt Deutscher Wortschatz into NLTK. GermaNet is a semantically-oriented dictionary of German, similar to WordNet. - NLTK's WordNet interface - links to WordNets for many languages, some restricted access; is one for Dutch with 117914 synsets. - GATE open source NLP framework, very powerful - FreeLing open source language analysis tool suite; a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, Geman, Russian, Catalan, Galician, Croatian, Slovene, among others). - EU project, OpeNER aims to be able to detect and disambiguate entity mentions and perform sentiment analysis and opinion detection on the texts, to be able for example, to extract the sentiment and the opinion of customers; 6 languages supported = Sketch Engine contains 400 ready-to-use corpora in 90+ languages, each having a size of up to 20 billion word; lots of tools - alignment, bilingual term extraction, thesaurus, concordance, etc. A word sketch is a one-page summary of the word’s grammatical and collocational behaviour. It shows the word’s collocates categorised by grammatical relations such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc. - SkELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of English to easily check whether or how a particular phrase or a word is used by real speakers of English. Also for Russian ... esReleases - Moses is the main open source machine translation system ... ng%29+Home
Joshua is another machine translation system, now an Apache project. For prebuilt language packs for a number of language pairs, see ... uage+Packs. - Apertium open-source machine translation; stable releases for 43 language pairs. - simple interface for non-developers, for Windows and Mac OS X - HFST, Helsinki Finite-State Transducer toolkit is intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity. Python API
"Natural" is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, string similarity, and some inflections are currently supported. Mostly for English, but Russian and Spanish stemming supported
retext is an ecosystem of plug-ins for processing natural language. - NLP for English - open source search engine, supports multilingual corpora - modern C++ data sciences toolkit, includes language model support, UTF8 support for analysis on various languages; has an overview of basic text analysis. ... jects.html
POS Tagger and Lemmatizer for English, Dutch, French, German and Italian
POS Tagger and Lemmatizer for Spanish ... _web.ipynb - document clustering with Python - used by Lilt to build neural morphology engines - toolkit for corpus linguistics

lttoolbox, associated with the Apertium machine translation project, is a fast (~58000 words/second) Free and Open Source finite-state toolkit.
lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words.

Apy Apertium server in Python

Paper Machines, an add-on for the bibliographic management software Zotero, incorporates a range of text analysis tools into your web browser; for the digital humanities. French WordNet lexique morphologique et syntaxique
Morfette is a tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas and morphological labels, and optionally a lexicon, morfette learns how to morphologically analyse new sentences., which includes ... alysis-176
This tool produces a full morphological analysis of the submitted text for Czech, English, French, German, Spanish, Hungarian, Italian, Polish and Russian.
Part of speech tagging is also available; this tool assigns a part of speech (POS) tags to each word of the input text. ... API%20Docs has instructions on how to call the API, but there is a Web interface at the fst-nlp-tools link.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

Apache Stambol does text analysis and enrichment, using Apache OpenNLP - ... ancer/nlp/. There's a list of supported languages at ... -languages. - a PowerShell script to let Windows 10 users use system TTS voices from the command line - can be integrated with other tools ... dprofiler/
A freeware tool for profiling the vocabulary level and complexity of texts; best support for English, based on tools from Paul Nation's site. ... b-programs - vocabulary analysis programs for English - Japanese NLP library - overview of Japanese text analysis tools for Russian morphological analysis.
Frequency distribution of words in texts. Tokenize, remove stopwords, stem words, count words.
Supported languages: da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, se, tr.
An efficient Chinese text segmentation tool
German lemmatizer
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Overview is a tool for analyzing large sets of documents. In includes a sophisticated search engine, word clouds, entity detection, and topic-based document clustering. If that’s not good enough, you can write your own plugins using the API.

Textable is an open source program for text analysis. It offers a set of basic text-analytic components (e.g. import text from files, segment into words, measure segment diversity, etc.), which the user combines using a visual interface to build custom analytic workflows.

Website: for Portuguese and etc.?

Italian NLP: ... y/releases

For Russian morphological analysis, lemmatization, etc. see phpmorphy, Also supports English and German.

Лемматизация (получение нормальной формы слова);
Получение всех форм слова;
Получение грамматической информации для слова (часть речи, падеж, спряжение и т.д.);
Изменение формы слова в соответствии с заданными грамматическими характеристиками;
Изменение формы слова по заданному образцу.

On GitHub (I mostly look for Node.js modules there), Rakuten MA - morphological analyzer (word segmentation + PoS Tagger) for Chinese and Japanese written purely in JavaScript.

Tokenize, POS Tagger, lemmatizer and stemmer - I use this one. The author is very responsive - a bug I asked about was fixed within hours.

stopwords-ja, etc.
stopwords-iso - comprehensive collection! probably easier than using all the others individually


kuroshiro.js is a japanese language utility mainly for converting Kanji-mixed sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.

nihongo.js has methods for analysing characters and sentences and parsing sentences.


Fuzzlogia is a simple Japanese-kanji-reading-aware fuzzy search library written in JavaScript.

A command line sidekick app for reading Japanese with the dictionary
A Japanese language parser producing NLCST nodes.
romaji - JavaScript utility that makes conversion between Japanese romaji and kana. Currently supports Hepburn system only.

Микробиблиотека для склонения слов в русском языке

JS-библиотека для морфологического анализа, токенизации и прочих NLP-задач для русского языка.

verbix (scraper module

german - command-line verb conjugator, tester

npm install germansynonyms


Detect the ease of reading a text according to the German variation of the Flesch Reading Ease Formular

Conjugation of irregular verbs in German. # Usage

German dictionary in terminal, powered by Wiktionary.

Rakuten MA - morphological analyzer (word segmentation + PoS Tagger) for Chinese and Japanese written purely in JavaScript.

npm install parse-japanese
Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation.
Create gloss for Japanese texts (based on Kuromoji tokenizer)
A Japanese verb conjugator and unconjugator
Hackbright Project: OCR for Japanese

Library for automatically rendering Furigana for inputed Japanese Text.

Japanese text difficulty analyzer

Japanese Morphological analysis app built by electron
Japanese language support for retext.
Natural language processor powered by plugins
Yet another Japanese morphological analyzer
KoNLPy is a Python package for natural language processing of the Korean language.
Korean morphological analyzer - JavaScript

R package for Korean NLP

Korean verb conjugator implemented in Javascript with native Android and iOS UIs ... mecab.html

Korean morphological analyzer - JavaScript
R package for Korean NLP

Korean verb conjugator implemented in Javascript with native Android and iOS UIs ... mecab.html
A simple dictionary for Korean, powered by National Institute of the Korean Language
A free Japanese dictionary and learning assistant

The Stanford Word Segmenter currently supports Arabic and Chinese. (The Stanford Tokenizer can be used for English, French, and Spanish.)
The Stanford Word Segmenter is incorporated into nltk's tokenize package.

Stanford CoreNLP - Arabic, Chinese, English, French, German, Spanish
Other people have developed models using or compatible with CoreNLP for several further languages. They may or may not be compatible with the most recent release of CoreNLP that we provide.

Italian: Tint by Alessio Palmero Aprosio and Giovanni Moretti (Fondazione Bruno Kessler) largely builds on CoreNLP, but adds some other components, to provide a quite complete processing pipeline for Italian.
Portuguese (European): LX parser by Patricia Gonçalves and João Silva (University of Lisbon) provides a constituency parser. It was built with a now quite old version of Stanford NLP.
Swedish: Andreas Klintberg has built an NER model and a POS tagger.
Stanford Log-linear Part-Of-Speech Tagger
The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language. ... s-for-nlp/ English lemmatizer etc. language codes as JSON Python script to get JSON for codes HanNanum Korean Morphological Analyzer & POS Tagger (Java version)
CST's lemmatizer uses affix rules (affix: prefix, infix, suffix, circumfix) and has been trained for a number of languages. Trained affix rules are available for the following languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, and Ukrainian. French lemmatizer

TreeTagger (recommended): supports fr, en, es, de, ru, da
wrapper for Russian morphological analyzer

pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages.
free morphological analyzer for Turkish

Omorfi–Open source morphological analyzer of Finnish

AraMorph is a Java port of the homonym product developed in Perl by Tim Buckwalter on behalf of the Linguistic Data Consortium (LDC)

Perstem - a stemmer and light morphological analyzer for Persian Chinese NLP toolkit

textsum: Text summarization with TensorFlow

blog: ... rflow.html
github: ... er/textsum
How to Run Text Summarization with TensorFlow

blog: ... .mll1rqgjg

There are a ton of ready-made Docker images to host various language tools, which might be the easiest way to install some of them. These include:
KNP, a Japanese Dependency and Case Structure Analyzer.
It also contains JUMAN, a User-Extensible Morphological Analyzer for Japanese.
DIT4C is a platform for hosting data analysis tools "in the cloud" using containers.

You can probably find more on Docker Hub. - LF Aligner creates translation memories from parallel texts. - OmegaT Computer-Assisted Translation software for translators (fast and robust) - formerly commercial Computer-Assisted Translation software, now open-source - formerly commercial Computer-Assisted Translation software, now free - set of tools to support localization and translation processes - TMLookup is an open-source tool to search very large translation memories (bilingual/multilingual databases), without a CAT tool. TMLookup can handle any number of languages and return search results in well under a second even if the database contains tens of millions of entries. by the author of LF Aligner - terminology management tool that can handle lots of formats; current Unicode version is commercial, older non-Unicode one is free

(There are a lot of other resources out there for the CAT category, and I haven't listed any of the main commercial tool vendors; this is probably enough CAT resources for now, though I think CAT software and translation memories in general are very useful resources for language learners - the easiest way by far to use bilingual corpora.)

And of course there is, which includes Antconc, a free corpus analysis toolkit and concordancer (monolingual); AntPConc, a freeware **parallel** corpus analysis toolkit for concordancing and text analysis using UTF-8 encoded text files; AntWordProfiler, which I might have mentioned above; ProtAnt, TagAnt, and many other cool tools, all very easy to install and use.

Have I left out anything obvious?
13 x

Return to “Language Programs and Resources”

Who is online

Users browsing this forum: veggiegirl314 and 1 guest