I thought I would offer a few additions to reineke's list of language tools at viewtopic.php?f=19&t=2900. I have my own collection of bookmarks and notes on Natural Language Processing (NLP) tools; this includes some Dutch support per zhuzilu's recent question.
"Tool" is a broad concept; these do a wide variety of useful things. Some of these are ready to use or even have Web interfaces; some are software libraries meant to support other tools, in a variety of programming languages. The range of natural languages covered out-of-the-box by various tools varies wildly. Some of these links I probably found here. I note them as I come across them and have not attempted to organize them. This list probably barely scratches the surface; there is a lot out there, with more coming all the time. I'm still exploring, and will probably end up integrating some of the tools below with mine (which is the main reason I've been exploring - I don't want to recreate components that are already available).
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.
See also https://proycon.github.io/LaMachine/. Dutch tools available in a VirtualBox image, etc.
https://cloud.google.com/natural-language/ - Google Cloud Natural Language API provides analysis of entities, sentiment, and syntax, including POS, lemma, and morphology for several languages; nice sentence diagrams! Supports Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Spanish. requires Google Cloud account
https://aws.amazon.com/polly/ - high-quality TTS voices for 24 languages; requires Amazon Web Services account.
Amazon Lex - for building voice and text chatbots.
https://translate.google.com/ - machine translation for about 100 languages, I think (maybe too well known to list)
https://www.wordsapi.com/. - English language only
Voyant Tools is a web-based reading and analysis environment for digital texts; supports 10 languages, including Japanese.
http://corpus.tools/ - links to a number of software tools for corpus processing
http://radimrehurek.com/gensim/ - Gensim is a Python library for topic modeling, etc.
https://nlp.fi.muni.cz/trac/noske - NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system. NoSketch Engine is a limited version of the software empowering the famous Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and many other excellent features. a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures. It utilitates a fast indexing library called Finlib.
https://spacy.io/ - "industrial-strength natural language processing" ; Python library, features include tokenization, sentence segmentation, word vectors, POS tagging, named entity recognition; statistical models for English and German
http://www.danielnaber.de/morphologie/ - German morphological analysis
https://code.google.com/archive/p/berkeleyaligner/ Berkeley word aligner
https://github.com/clab/fast_align - Simple, fast unsupervised word aligner
http://arabic.emi.ac.ma/safar/ - platform dedicated to ANLP (Arabic Natural Language Processing)
https://www.microsoft.com/en-us/researc ... -learners/ - MS blog post on Engkoo software aids for Chinese learners of English; “It unifies human translation mined from the web, machine translation, and a language-learning experience into one user-friendly search-and-explore interface.... The technology itself is language independent and can be extended to other language pairs in the future.” Something to keep an eye on.
https://console.ng.bluemix.net/catalog/ ... -to-speech
http://mdn.github.io/web-speech-api/spe ... synthesis/
dsl2mobi (inflection word lists)
http://www.wordscope.com - sentence pairs linked to TED etc.
Pattern (Python 2.7 NLP library) supports Dutch, English, Spanish, German, French and Italian.
NLTK, Natural Language Toolkit, the main Python library for NLP - http://www.nltk.org/
https://docs.google.com/document/d/1rdn ... 2prxyeds5k - German NLTK; GermaNLTK is an integration of GermaNet and Projekt Deutscher Wortschatz into NLTK. GermaNet is a semantically-oriented dictionary of German, similar to WordNet.
http://www.nltk.org/howto/wordnet.html - NLTK's WordNet interface
http://globalwordnet.org/wordnets-in-the-world/ - links to WordNets for many languages, some restricted access; http://wordpress.let.vupr.nl/odwn/ is one for Dutch with 117914 synsets.
https://gate.ac.uk/ - GATE open source NLP framework, very powerful
http://nlp.lsi.upc.edu/freeling/node/1 - FreeLing open source language analysis tool suite; a C++ library providing language analysis functionalities (morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc.) for a variety of languages (English, Spanish, Portuguese, Italian, French, Geman, Russian, Catalan, Galician, Croatian, Slovene, among others).
http://www.opener-project.eu/ - EU project, OpeNER aims to be able to detect and disambiguate entity mentions and perform sentiment analysis and opinion detection on the texts, to be able for example, to extract the sentiment and the opinion of customers; 6 languages supported
https://www.sketchengine.co.uk/ = Sketch Engine contains 400 ready-to-use corpora in 90+ languages, each having a size of up to 20 billion word; lots of tools - alignment, bilingual term extraction, thesaurus, concordance, etc. A word sketch is a one-page summary of the word’s grammatical and collocational behaviour. It shows the word’s collocates categorised by grammatical relations such as words that serve as an object of the verb, words that serve as a subject of the verb, words that modify the word etc.
https://www.sketchengine.co.uk/skell/ - SkELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of English to easily check whether or how a particular phrase or a word is used by real speakers of English. Also for Russian
http://www.statmt.org/mosescore/index.p ... esReleases - Moses is the main open source machine translation system
https://cwiki.apache.org/confluence/dis ... ng%29+Home
Joshua is another machine translation system, now an Apache project. For prebuilt language packs for a number of language pairs, see https://cwiki.apache.org/confluence/dis ... uage+Packs.
http://wiki.apertium.org/wiki/Main_Page - Apertium open-source machine translation; stable releases for 43 language pairs. http://wiki.apertium.org/wiki/Apertium_Simpleton_UI - simple interface for non-developers, for Windows and Mac OS X
https://hfst.github.io/ - HFST, Helsinki Finite-State Transducer toolkit is intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity. Python API
"Natural" is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, string similarity, and some inflections are currently supported. Mostly for English, but Russian and Spanish stemming supported
retext is an ecosystem of plug-ins for processing natural language.
https://github.com/nlp-compromise/compromise - NLP for English
http://terrier.org/ - open source search engine, supports multilingual corpora
https://meta-toolkit.org/ - modern C++ data sciences toolkit, includes language model support, UTF8 support for analysis on various languages; https://meta-toolkit.org/profile-tutorial.html has an overview of basic text analysis.
http://staffwww.dcs.shef.ac.uk/people/A ... jects.html
POS Tagger and Lemmatizer for English, Dutch, French, German and Italian
POS Tagger and Lemmatizer for Spanish
https://nbviewer.jupyter.org/github/bra ... _web.ipynb - document clustering with Python
https://github.com/oscii-lab/lex - used by Lilt to build neural morphology engines
https://github.com/interrogator/corpkit - toolkit for corpus linguistics
lttoolbox, associated with the Apertium machine translation project, is a fast (~58000 words/second) Free and Open Source finite-state toolkit.
lttoolbox is a toolbox for lexical processing, morphological analysis and generation of words.
Apy Apertium server in Python
Paper Machines, an add-on for the bibliographic management software Zotero, incorporates a range of text analysis tools into your web browser; for the digital humanities. https://github.com/papermachines/papermachines
http://alpage.inria.fr/~sagot/wolf.html French WordNet
http://alpage.inria.fr/~sagot/lefff.html lexique morphologique et syntaxique
Morfette is a tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas and morphological labels, and optionally a lexicon, morfette learns how to morphologically analyse new sentences.
https://open.xerox.com/Services/fst-nlp-tools, which includes
https://open.xerox.com/Services/fst-nlp ... alysis-176
This tool produces a full morphological analysis of the submitted text for Czech, English, French, German, Spanish, Hungarian, Italian, Polish and Russian.
Part of speech tagging is also available; this tool assigns a part of speech (POS) tags to each word of the input text.
https://open.xerox.com/Services/fst-nlp ... API%20Docs has instructions on how to call the API, but there is a Web interface at the fst-nlp-tools link.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
Apache Stambol does text analysis and enrichment, using Apache OpenNLP - https://stanbol.apache.org/docs/trunk/c ... ancer/nlp/. There's a list of supported languages at https://stanbol.apache.org/docs/trunk/c ... -languages.
https://github.com/exactmike/OutSpeech - a PowerShell script to let Windows 10 users use system TTS voices from the command line - can be integrated with other tools
http://www.laurenceanthony.net/software ... dprofiler/
A freeware tool for profiling the vocabulary level and complexity of texts; best support for English, based on tools from Paul Nation's site.
http://www.victoria.ac.nz/lals/about/st ... b-programs - vocabulary analysis programs for English
https://jprocessing.readthedocs.io/en/latest/ - Japanese NLP library
http://guides.library.upenn.edu/japanesetext - overview of Japanese text analysis tools
https://nlpub.ru/MaltParser for Russian morphological analysis.
Frequency distribution of words in texts. Tokenize, remove stopwords, stem words, count words.
Supported languages: da, de, en, es, fi, fr, hu, it, nl, no, pt, ro, ru, se, tr.
An efficient Chinese text segmentation tool
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Overview is a tool for analyzing large sets of documents. In includes a sophisticated search engine, word clouds, entity detection, and topic-based document clustering. If that’s not good enough, you can write your own plugins using the API. https://www.overviewdocs.com/
Textable is an open source program for text analysis. It offers a set of basic text-analytic components (e.g. import text from files, segment into words, measure segment diversity, etc.), which the user combines using a visual interface to build custom analytic workflows.
https://pypi.python.org/pypi/unitexlemmatizer/1.0.0 for Portuguese and etc.?
Italian NLP: https://github.com/jacopofar/italian-nl ... y/releases
For Russian morphological analysis, lemmatization, etc. see phpmorphy, https://sourceforge.net/projects/phpmorphy/. Also supports English and German.
Лемматизация (получение нормальной формы слова);
Получение всех форм слова;
Получение грамматической информации для слова (часть речи, падеж, спряжение и т.д.);
Изменение формы слова в соответствии с заданными грамматическими характеристиками;
Изменение формы слова по заданному образцу.
Tokenize, POS Tagger, lemmatizer and stemmer - I use this one. The author is very responsive - a bug I asked about was fixed within hours.
stopwords-iso - comprehensive collection! probably easier than using all the others individually
kuroshiro.js is a japanese language utility mainly for converting Kanji-mixed sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
nihongo.js has methods for analysing characters and sentences and parsing sentences.
A command line sidekick app for reading Japanese with the Jisho.org dictionary
A Japanese language parser producing NLCST nodes.
Микробиблиотека для склонения слов в русском языке
JS-библиотека для морфологического анализа, токенизации и прочих NLP-задач для русского языка.
verbix (scraper module
german - command-line verb conjugator, tester
npm install germansynonyms
Detect the ease of reading a text according to the German variation of the Flesch Reading Ease Formular
Conjugation of irregular verbs in German. # Usage
German dictionary in terminal, powered by Wiktionary.
npm install parse-japanese
Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation.
Create gloss for Japanese texts (based on Kuromoji tokenizer) https://kuromoji.fluentcards.com
A Japanese verb conjugator and unconjugator
Hackbright Project: OCR for Japanese
Library for automatically rendering Furigana for inputed Japanese Text.
Japanese text difficulty analyzer
Japanese Morphological analysis app built by electron
Japanese language support for retext.
Natural language processor powered by plugins
Yet another Japanese morphological analyzer
KoNLPy is a Python package for natural language processing of the Korean language.
R package for Korean NLP
https://dev.mysql.com/doc/refman/5.7/en ... mecab.html
R package for Korean NLP
https://dev.mysql.com/doc/refman/5.7/en ... mecab.html
A simple dictionary for Korean, powered by National Institute of the Korean Language
A free Japanese dictionary and learning assistant http://www.tagaini.net
The Stanford Word Segmenter currently supports Arabic and Chinese. (The Stanford Tokenizer can be used for English, French, and Spanish.)
The Stanford Word Segmenter is incorporated into nltk's tokenize package.
Stanford CoreNLP - Arabic, Chinese, English, French, German, Spanish
Other people have developed models using or compatible with CoreNLP for several further languages. They may or may not be compatible with the most recent release of CoreNLP that we provide.
Italian: Tint by Alessio Palmero Aprosio and Giovanni Moretti (Fondazione Bruno Kessler) largely builds on CoreNLP, but adds some other components, to provide a quite complete processing pipeline for Italian.
Portuguese (European): LX parser by Patricia Gonçalves and João Silva (University of Lisbon) provides a constituency parser. It was built with a now quite old version of Stanford NLP.
Swedish: Andreas Klintberg has built an NER model and a POS tagger.
Stanford Log-linear Part-Of-Speech Tagger
The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.
http://johnlaudun.org/20170228-open-sou ... s-for-nlp/
https://github.com/Planeshifter/node-wordnet-magic English lemmatizer etc.
https://github.com/maikudou/iso639-js language codes as JSON
http://kaapstorm.com/post/html-in-json-out/ Python script to get JSON for codes
https://sourceforge.net/projects/hannanum/files/ HanNanum Korean Morphological Analyzer & POS Tagger (Java version)
CST's lemmatizer uses affix rules (affix: prefix, infix, suffix, circumfix) and has been trained for a number of languages. Trained affix rules are available for the following languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, and Ukrainian.
https://github.com/ecirtap/docker-flemm French lemmatizer
TreeTagger (recommended): supports fr, en, es, de, ru, da
wrapper for Russian morphological analyzer
pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages.
free morphological analyzer for Turkish
Omorfi–Open source morphological analyzer of Finnish
AraMorph is a Java port of the homonym product developed in Perl by Tim Buckwalter on behalf of the Linguistic Data Consortium (LDC)
Perstem - a stemmer and light morphological analyzer for Persian
https://github.com/FudanNLP/fnlp Chinese NLP toolkit
textsum: Text summarization with TensorFlow
blog: https://research.googleblog.com/2016/08 ... rflow.html
github: https://github.com/tensorflow/models/tr ... er/textsum
How to Run Text Summarization with TensorFlow
blog: https://medium.com/@surmenok/how-to-run ... .mll1rqgjg
There are a ton of ready-made Docker images to host various language tools, which might be the easiest way to install some of them. These include:
KNP, a Japanese Dependency and Case Structure Analyzer.
It also contains JUMAN, a User-Extensible Morphological Analyzer for Japanese.
DIT4C is a platform for hosting data analysis tools "in the cloud" using containers. https://dit4c.github.io/
You can probably find more on Docker Hub.
https://sourceforge.net/projects/aligner/ - LF Aligner creates translation memories from parallel texts.
http://www.omegat.org/en/omegat.html - OmegaT Computer-Assisted Translation software for translators (fast and robust)
https://github.com/heartsome/translationstudio8 - formerly commercial Computer-Assisted Translation software, now open-source
http://felix-cat.com/ - formerly commercial Computer-Assisted Translation software, now free
http://okapiframework.org/ - set of tools to support localization and translation processes
http://www.farkastranslations.com/tmlookup.php - TMLookup is an open-source tool to search very large translation memories (bilingual/multilingual databases), without a CAT tool. TMLookup can handle any number of languages and return search results in well under a second even if the database contains tens of millions of entries. by the author of LF Aligner
https://www.xbench.net/ - terminology management tool that can handle lots of formats; current Unicode version is commercial, older non-Unicode one is free
(There are a lot of other resources out there for the CAT category, and I haven't listed any of the main commercial tool vendors; this is probably enough CAT resources for now, though I think CAT software and translation memories in general are very useful resources for language learners - the easiest way by far to use bilingual corpora.)
And of course there is http://www.laurenceanthony.net/software.html, which includes Antconc, a free corpus analysis toolkit and concordancer (monolingual); AntPConc, a freeware **parallel** corpus analysis toolkit for concordancing and text analysis using UTF-8 encoded text files; AntWordProfiler, which I might have mentioned above; ProtAnt, TagAnt, and many other cool tools, all very easy to install and use.
Have I left out anything obvious?
All about language programs, courses, websites and other learning resources
1 post • Page 1 of 1
- Orange Belt
- Posts: 135
- Joined: Sun Feb 26, 2017 4:01 pm
- Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
- x 346