Spoken Language Corpora

Ruan · Postby **Ruan** » Mon Sep 28, 2015 2:47 pm

As this is a resource that is rarely if ever mentioned, I thought I'd be worthy of a thread.

Linguists all throughout the world are compiling spoken language corpora and making them available on the internet; some of them are available for download. Although they have compiled these corpora for their own research purposes, they're obviously very useful for the language learner, as they provide him with exposure to everyday speech altogether with accurate annotated transcriptions.

I'll edit this post with any language corpus somebody may post in this thread. If a significant amount of links get posted, I'll create a wikia article on the subject. For the sake of convenience, classify your links as "Free" ; "Partially free" and "Paid" while posting, and also by language.

Free

American English
Santa Barbara Corpus of Spoken American English. About 23 hours of spontaneous conversation recordings.

Australian English
PYB007 is a class held by Australian universities for psychology students, in which students must record a conversation with somebody else to improve their interpersonal communication skills. The good news is that students are posting these conversations on Youtube. There are no transcriptions.Click here to google for these videos.

British English
Saarbrücken Corpus of Spoken English

Danish
SamTale Bank inside TalkBank ( see Multilingual ).

French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.

Multilingual
OPUS, the open parallel corpus
Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged. If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.

TalkBank. CABank has lots of languages, western and eastern.

Partially free

American English
Charlotte Narratives - New South Voices - J. Murrey Atkins Library Special Collections

Paid

Brazilian Portuguese
C-ORAL BRASIL

Corpora lists

English
Spoken Corpora of English
Corpora4Learning.net

AlexTG · Postby **AlexTG** » Mon Sep 28, 2015 2:59 pm

Free
French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.

Postby **emk** » Mon Sep 28, 2015 3:07 pm

Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged.

If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.

A language learners’ forum

Spoken Language Corpora

Spoken Language Corpora

Re: Spoken Language Corpora

Re: Spoken Language Corpora

Who is online