list of natural language processing resources and tools

All about language programs, courses, websites and other learning resources
mcthulhu
Orange Belt
Posts: 124
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 321

list of natural language processing resources and tools

Postby mcthulhu » Sat May 06, 2017 12:12 pm

I thought I would offer a few additions to reineke's list of language tools at viewtopic.php?f=19&t=2900. I have my own collection of bookmarks and notes on Natural Language Processing (NLP) tools; this includes some Dutch support per zhuzilu's recent question.

"Tool" is a broad concept; these do a wide variety of useful things. Some of these are ready to use or even have Web interfaces; some are software libraries meant to support other tools, in a variety of programming languages. The range of natural languages covered out-of-the-box by various tools varies wildly. Some of these links I probably found here. I note them as I come across them and have not attempted to organize them. This list probably barely scratches the surface; there is a lot out there, with more coming all the time. I'm still exploring, and will probably end up integrating some of the tools below with mine (which is the main reason I've been exploring - I don't want to recreate components that are already available).

https://languagemachines.github.io/frog/
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

See also https://proycon.github.io/LaMachine/. Dutch tools available in a VirtualBox image, etc.

https://cloud.google.com/natural-language/ - Google Cloud Natural Language API provides analysis of entities, sentiment, and syntax, including POS, lemma, and morphology for several languages; nice sentence diagrams! Supports Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Spanish. requires Google Cloud account

https://aws.amazon.com/polly/ - high-quality TTS voices for 24 languages; requires Amazon Web Services account.

Amazon Lex - for building voice and text chatbots.

https://translate.google.com/ - machine translation for about 100 languages, I think (maybe too well known to list)

https://yomguithereal.github.io/talisman/
A straightforward & modular NLP, machine learning & fuzzy matching library for JavaScript.

https://www.wordsapi.com/. - English language only
https://github.com/wordnik/swagger-js
https://github.com/neopunisher/pos-js

http://voyant-tools.org/
Voyant Tools is a web-based reading and analysis environment for digital texts; supports 10 languages, including Japanese.

https://github.com/nikolamilosevic86/SerbianStemmer

http://corpus.tools/ - links to a number of software tools for corpus processing

http://radimrehurek.com/gensim/ - Gensim is a Python library for topic modeling, etc.

https://nlp.fi.muni.cz/trac/noske - NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system. NoSketch Engine is a limited version of the software empowering the famous Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and many other excellent features. a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures. It utilitates a fast indexing library called Finlib.

https://spacy.io/ - "industrial-strength natural language processing" ; Python library, features include tokenization, sentence segmentation, word vectors, POS tagging, named entity recognition; statistical models for English and German

http://www.danielnaber.de/morphologie/ - German morphological analysis

https://code.google.com/archive/p/berkeleyaligner/ Berkeley word aligner

https://github.com/clab/fast_align - Simple, fast unsupervised word aligner

http://arabic.emi.ac.ma/safar/ - platform dedicated to ANLP (Arabic Natural Language Processing)

https://www.microsoft.com/en-us/researc ... -learners/ - MS blog post on Engkoo software aids for Chinese learners of English; “It unifies human translation mined from the web, machine translation, and a language-learning experience into one user-friendly search-and-explore interface.... The technology itself is language independent and can be extended to other language pairs in the future.” Something to keep an eye on.

https://textgrid.de/download

https://console.ng.bluemix.net/catalog/ ... -to-speech
http://mdn.github.io/web-speech-api/spe ... synthesis/
http://peterjc.com/wordmultisearch/#/search
http://www.lexiconista.com/datasets/lemmatization/
dsl2mobi (inflection word lists)
http://responsivevoice.org/text-to-speech-languages/
http://www.wordscope.com - sentence pairs linked to TED etc.

Pattern (Python 2.7 NLP library) supports Dutch, English, Spanish, German, French and Italian.
NLTK, Natural Language Toolkit, the main Python library for NLP - http://www.nltk.org/

https://docs.google.com/document/d/1rdn ... 2prxyeds5k - German NLTK; GermaNLTK is an integration of GermaNet and Projekt Deutscher Wortschatz into NLTK. GermaNet is a semantically-oriented dictionary of German, similar to WordNet.

http://www.nltk.org/howto/wordnet.html - NLTK's WordNet interface

http://globalwordnet.org/wordnets-in-the-world/ - links to WordNets for many languages, some restricted access; http://wordpress.let.vupr.nl/odwn/ is one for Dutch with 117914 synsets.

http://www.online-utility.org/text/analyzer.jsp

https://gate.ac.uk/ - GATE open source NLP framework, very powerful

http://nlp.lsi.upc.edu/freeling/node/1 - FreeLing open source language analysis tool suite; a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, Geman, Russian, Catalan, Galician, Croatian, Slovene, among others).

http://www.opener-project.eu/ - EU project, OpeNER aims to be able to detect and disambiguate entity mentions and perform sentiment analysis and opinion detection on the texts, to be able for example, to extract the sentiment and the opinion of customers; 6 languages supported

https://www.sketchengine.co.uk/ = Sketch Engine contains 400 ready-to-use corpora in 90+ languages, each having a size of up to 20 billion word; lots of tools - alignment, bilingual term extraction, thesaurus, concordance, etc. A word sketch is a one-page summary of the word’s grammatical and collocational behaviour. It shows the word’s collocates categorised by grammatical relations such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc.
https://www.sketchengine.co.uk/skell/ - SkELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of English to easily check whether or how a particular phrase or a word is used by real speakers of English. Also for Russian

http://www.statmt.org/mosescore/index.p ... esReleases - Moses is the main open source machine translation system

https://cwiki.apache.org/confluence/dis ... ng%29+Home
Joshua is another machine translation system, now an Apache project. For prebuilt language packs for a number of language pairs, see https://cwiki.apache.org/confluence/dis ... uage+Packs.

http://wiki.apertium.org/wiki/Main_Page - Apertium open-source machine translation; stable releases for 43 language pairs. http://wiki.apertium.org/wiki/Apertium_Simpleton_UI - simple interface for non-developers, for Windows and Mac OS X

https://hfst.github.io/ - HFST, Helsinki Finite-State Transducer toolkit is intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity. Python API

https://github.com/NaturalNode/natural
"Natural" is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, string similarity, and some inflections are currently supported. Mostly for English, but Russian and Spanish stemming supported

https://github.com/wooorm/retext
retext is an ecosystem of plug-ins for processing natural language.

https://github.com/nlp-compromise/compromise - NLP for English

http://terrier.org/ - open source search engine, supports multilingual corpora
https://meta-toolkit.org/ - modern C++ data sciences toolkit, includes language model support, UTF8 support for analysis on various languages; https://meta-toolkit.org/profile-tutorial.html has an overview of basic text analysis.



http://staffwww.dcs.shef.ac.uk/people/A ... jects.html
POS Tagger and Lemmatizer for English, Dutch, French, German and Italian
POS Tagger and Lemmatizer for Spanish

https://nbviewer.jupyter.org/github/bra ... _web.ipynb - document clustering with Python

https://github.com/oscii-lab/lex - used by Lilt to build neural morphology engines
https://dev.panlex.org/api/

https://github.com/interrogator/corpkit - toolkit for corpus linguistics

lttoolbox, associated with the Apertium machine translation project, is a fast (~58000 words/second) Free and Open Source finite-state toolkit.
lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words.

Apy Apertium server in Python

Paper Machines, an add-on for the bibliographic management software Zotero, incorporates a range of text analysis tools into your web browser; for the digital humanities. https://github.com/papermachines/papermachines

http://alpage.inria.fr/~sagot/wolf.html French WordNet
http://alpage.inria.fr/~sagot/lefff.html lexique morphologique et syntaxique
https://sites.google.com/site/morfetteweb/
Morfette is a tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas and morphological labels, and optionally a lexicon, morfette learns how to morphologically analyse new sentences.

https://open.xerox.com/Services/arabic-morphology

https://open.xerox.com/Services/fst-nlp-tools, which includes
https://open.xerox.com/Services/fst-nlp ... alysis-176
This tool produces a full morphological analysis of the submitted text for Czech, English, French, German, Spanish, Hungarian, Italian, Polish and Russian.
Part of speech tagging is also available; this tool assigns a part of speech (POS) tags to each word of the input text.
https://open.xerox.com/Services/fst-nlp ... API%20Docs has instructions on how to call the API, but there is a Web interface at the fst-nlp-tools link.

https://opennlp.apache.org/
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

Apache Stambol does text analysis and enrichment, using Apache OpenNLP - https://stanbol.apache.org/docs/trunk/c ... ancer/nlp/. There's a list of supported languages at https://stanbol.apache.org/docs/trunk/c ... -languages.

https://github.com/exactmike/OutSpeech - a PowerShell script to let Windows 10 users use system TTS voices from the command line - can be integrated with other tools

http://www.laurenceanthony.net/software ... dprofiler/
A freeware tool for profiling the vocabulary level and complexity of texts; best support for English, based on tools from Paul Nation's site.

http://www.victoria.ac.nz/lals/about/st ... b-programs - vocabulary analysis programs for English
https://jprocessing.readthedocs.io/en/latest/ - Japanese NLP library
http://guides.library.upenn.edu/japanesetext - overview of Japanese text analysis tools

https://www.branah.com/unicode-converter

https://nlpub.ru/MaltParser for Russian morphological analysis.


https://github.com/Amberlamps/nlp-toolkit
Frequency distribution of words in texts. Tokenize, remove stopwords, stem words, count words.
Supported languages: da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, se, tr.


https://pypi.python.org/pypi/thulac/0.1.1
An efficient Chinese text segmentation tool

https://wiki.de.dariah.eu/display/TextGrid/Lemmatizer
German lemmatizer

http://mallet.cs.umass.edu/
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Overview is a tool for analyzing large sets of documents. In includes a sophisticated search engine, word clouds, entity detection, and topic-based document clustering. If that’s not good enough, you can write your own plugins using the API. https://www.overviewdocs.com/

Textable is an open source program for text analysis. It offers a set of basic text-analytic components (e.g. import text from files, segment into words, measure segment diversity, etc.), which the user combines using a visual interface to build custom analytic workflows.

Website: http://textable.io

https://pypi.python.org/pypi/unitexlemmatizer/1.0.0 for Portuguese and etc.?

Italian NLP: https://github.com/jacopofar/italian-nl ... y/releases

For Russian morphological analysis, lemmatization, etc. see phpmorphy, https://sourceforge.net/projects/phpmorphy/. Also supports English and German.

Лемматизация (получение нормальной формы слова);
Получение всех форм слова;
Получение грамматической информации для слова (часть речи, падеж, спряжение и т.д.);
Изменение формы слова в соответствии с заданными грамматическими характеристиками;
Изменение формы слова по заданному образцу.

On GitHub (I mostly look for Node.js modules there), Rakuten MA - morphological analyzer (word segmentation + PoS Tagger) for Chinese and Japanese written purely in JavaScript.

nlp-js-tools-french
Tokenize, POS Tagger, lemmatizer and stemmer - I use this one. The author is very responsive - a bug I asked about was fixed within hours.

stopwords-fr
stopwords-eo
stopwords-it
stopwords-ja, etc.
stopwords-iso - comprehensive collection! probably easier than using all the others individually

korean-text-analytics
https://insikk.github.io/awesome-korean-nlp/

kuroshiro.js is a japanese language utility mainly for converting Kanji-mixed sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.

nihongo.js has methods for analysing characters and sentences and parsing sentences.

cjk-tokenizer


Fuzzlogia is a simple Japanese-kanji-reading-aware fuzzy search library written in JavaScript.

jish-sidekick
A command line sidekick app for reading Japanese with the Jisho.org dictionary
parse-japanese
A Japanese language parser producing NLCST nodes.
romaji - JavaScript utility that makes conversion between Japanese romaji and kana. Currently supports Hepburn system only.

proschet
Микробиблиотека для склонения слов в русском языке

az
JS-библиотека для морфологического анализа, токенизации и прочих NLP-задач для русского языка.

verbix (scraper module

german - command-line verb conjugator, tester

npm install germansynonyms

fleschDednlvgl

Detect the ease of reading a text according to the German variation of the Flesch Reading Ease Formular

irregular-verbs-de
Conjugation of irregular verbs in German. # Usage

dict-de
German dictionary in terminal, powered by Wiktionary.

Rakuten MA - morphological analyzer (word segmentation + PoS Tagger) for Chinese and Japanese written purely in JavaScript.

https://emorynlp.github.io/nlp4j/

https://github.com/muraken720/parse-japanese

npm install parse-japanese

https://github.com/lovell/hepburn
Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation.

https://github.com/katspaugh/kuromoji-gloss
Create gloss for Japanese texts (based on Kuromoji tokenizer) https://kuromoji.fluentcards.com

https://github.com/Pomax/node-jp-conjugations
A Japanese verb conjugator and unconjugator

https://github.com/sienatime/ocr-jpn
Hackbright Project: OCR for Japanese

https://github.com/spect88/muzukashii

autokana
Library for automatically rendering Furigana for inputed Japanese Text.

Japanese text difficulty analyzer
https://github.com/asayamakk/shirakaba

Japanese Morphological analysis app built by electron

https://github.com/muraken720/retext-japanese
Japanese language support for retext.
https://github.com/wooorm/retext
Natural language processor powered by plugins

https://github.com/siikamiika/mecab-translate
https://github.com/taku910/mecab
Yet another Japanese morphological analyzer

https://github.com/konlpy/konlpy
KoNLPy is a Python package for natural language processing of the Korean language.

https://github.com/shineware/komoran-2.0
Korean morphological analyzer - JavaScript

https://github.com/haven-jeon/KoNLP

R package for Korean NLP

https://github.com/dbravender/korean_conjugation

Korean verb conjugator implemented in Javascript with native Android and iOS UIs http://dongsa.net
https://bitbucket.org/eunjeon/
https://bitbucket.org/eunjeon/mecab-ko-dic
https://dev.mysql.com/doc/refman/5.7/en ... mecab.html

https://github.com/shineware/komoran-2.0

Korean morphological analyzer - JavaScript

https://github.com/haven-jeon/KoNLP
R package for Korean NLP
https://github.com/dbravender/korean_conjugation

Korean verb conjugator implemented in Javascript with native Android and iOS UIs http://dongsa.net
https://bitbucket.org/eunjeon/
https://bitbucket.org/eunjeon/mecab-ko-dic
https://dev.mysql.com/doc/refman/5.7/en ... mecab.html
https://github.com/ChalkPE/KoreanDictionary
A simple dictionary for Korean, powered by National Institute of the Korean Language

https://github.com/Gnurou/tagainijisho
A free Japanese dictionary and learning assistant http://www.tagaini.net

The Stanford Word Segmenter currently supports Arabic and Chinese. (The Stanford Tokenizer can be used for English, French, and Spanish.)
The Stanford Word Segmenter is incorporated into nltk's tokenize package.

Stanford CoreNLP - Arabic, Chinese, English, French, German, Spanish
Other people have developed models using or compatible with CoreNLP for several further languages. They may or may not be compatible with the most recent release of CoreNLP that we provide.

Italian: Tint by Alessio Palmero Aprosio and Giovanni Moretti (Fondazione Bruno Kessler) largely builds on CoreNLP, but adds some other components, to provide a quite complete processing pipeline for Italian.
Portuguese (European): LX parser by Patricia Gonçalves and João Silva (University of Lisbon) provides a constituency parser. It was built with a now quite old version of Stanford NLP.
Swedish: Andreas Klintberg has built an NER model and a POS tagger.
Stanford Log-linear Part-Of-Speech Tagger
The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.

http://johnlaudun.org/20170228-open-sou ... s-for-nlp/

https://github.com/Planeshifter/node-wordnet-magic English lemmatizer etc.

https://github.com/maikudou/iso639-js language codes as JSON
http://kaapstorm.com/post/html-in-json-out/ Python script to get JSON for codes

https://sourceforge.net/projects/hannanum/files/ HanNanum Korean Morphological Analyzer & POS Tagger (Java version)


http://cst.dk/online/lemmatiser/uk/
CST's lemmatizer uses affix rules (affix: prefix, infix, suffix, circumfix) and has been trained for a number of languages. Trained affix rules are available for the following languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, and Ukrainian.

https://github.com/ecirtap/docker-flemm French lemmatizer

TreeTagger (recommended): supports fr, en, es, de, ru, da

https://github.com/koorchik/node-mystem3
wrapper for Russian morphological analyzer

pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages.

http://coltekin.net/cagri/trmorph/
free morphological analyzer for Turkish

Omorfi–Open source morphological analyzer of Finnish

AraMorph is a Java port of the homonym product developed in Perl by Tim Buckwalter on behalf of the Linguistic Data Consortium (LDC)

http://xixona.dlsi.ua.es/~fran/persian/

Perstem - a stemmer and light morphological analyzer for Persian
https://github.com/jonsafari/perstem

https://github.com/FudanNLP/fnlp Chinese NLP toolkit

textsum: Text summarization with TensorFlow

blog: https://research.googleblog.com/2016/08 ... rflow.html
github: https://github.com/tensorflow/models/tr ... er/textsum
How to Run Text Summarization with TensorFlow

blog: https://medium.com/@surmenok/how-to-run ... .mll1rqgjg
github: https://github.com/surmenok/TextSum

There are a ton of ready-made Docker images to host various language tools, which might be the easiest way to install some of them. These include:

https://hub.docker.com/r/nkmry/knp/
KNP, a Japanese Dependency and Case Structure Analyzer.
It also contains JUMAN, a User-Extensible Morphological Analyzer for Japanese.

https://hub.docker.com/r/yasuyuky/nhocr/

https://hub.docker.com/r/heryandi/python3-nltk-gensim/

https://hub.docker.com/r/hltcoe/nltk/
https://hub.docker.com/r/ejwhite/flask-nltk/
https://hub.docker.com/r/wiseio/datascience-docker/
DIT4C is a platform for hosting data analysis tools "in the cloud" using containers. https://dit4c.github.io/
https://github.com/dit4c/dit4c
https://hub.docker.com/r/interrogator/corpkit/

https://hub.docker.com/r/kisad/corenlp-german/

https://hub.docker.com/r/vlall/moses-api/

https://hub.docker.com/r/amake/moses-smt/

https://hub.docker.com/r/hltcoe/stanford/

You can probably find more on Docker Hub.

https://sourceforge.net/projects/aligner/ - LF Aligner creates translation memories from parallel texts.

http://www.omegat.org/en/omegat.html - OmegaT Computer-Assisted Translation software for translators (fast and robust)

https://github.com/heartsome/translationstudio8 - formerly commercial Computer-Assisted Translation software, now open-source

http://felix-cat.com/ - formerly commercial Computer-Assisted Translation software, now free

http://okapiframework.org/ - set of tools to support localization and translation processes

http://www.farkastranslations.com/tmlookup.php - TMLookup is an open-source tool to search very large translation memories (bilingual/multilingual databases), without a CAT tool. TMLookup can handle any number of languages and return search results in well under a second even if the database contains tens of millions of entries. by the author of LF Aligner

https://www.xbench.net/ - terminology management tool that can handle lots of formats; current Unicode version is commercial, older non-Unicode one is free

(There are a lot of other resources out there for the CAT category, and I haven't listed any of the main commercial tool vendors; this is probably enough CAT resources for now, though I think CAT software and translation memories in general are very useful resources for language learners - the easiest way by far to use bilingual corpora.)

And of course there is http://www.laurenceanthony.net/software.html, which includes Antconc, a free corpus analysis toolkit and concordancer (monolingual); AntPConc, a freeware **parallel** corpus analysis toolkit for concordancing and text analysis using UTF-8 encoded text files; AntWordProfiler, which I might have mentioned above; ProtAnt, TagAnt, and many other cool tools, all very easy to install and use.

https://github.com/tesseract-ocr

Have I left out anything obvious?
15 x

Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 1 guest