Spoken Language Corpora

All about language programs, courses, websites and other learning resources
Ruan
Yellow Belt
Posts: 99
Joined: Thu Aug 27, 2015 1:05 pm
Languages: Brazilian Portuguese (N), English
Language Log: viewtopic.php?f=15&t=1370
x 154

Spoken Language Corpora

Postby Ruan » Mon Sep 28, 2015 2:47 pm

As this is a resource that is rarely if ever mentioned, I thought I'd be worthy of a thread.

Linguists all throughout the world are compiling spoken language corpora and making them available on the internet; some of them are available for download. Although they have compiled these corpora for their own research purposes, they're obviously very useful for the language learner, as they provide him with exposure to everyday speech altogether with accurate annotated transcriptions.

I'll edit this post with any language corpus somebody may post in this thread. If a significant amount of links get posted, I'll create a wikia article on the subject. For the sake of convenience, classify your links as "Free" ; "Partially free" and "Paid" while posting, and also by language.

Free

American English
Santa Barbara Corpus of Spoken American English. About 23 hours of spontaneous conversation recordings.

Australian English
PYB007 is a class held by Australian universities for psychology students, in which students must record a conversation with somebody else to improve their interpersonal communication skills. The good news is that students are posting these conversations on Youtube. There are no transcriptions.Click here to google for these videos.

British English
Saarbrücken Corpus of Spoken English

Danish
SamTale Bank inside TalkBank ( see Multilingual ).

French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.

Multilingual
OPUS, the open parallel corpus
Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged. If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.

TalkBank. CABank has lots of languages, western and eastern.

Partially free

American English
Charlotte Narratives - New South Voices - J. Murrey Atkins Library Special Collections

Paid

Brazilian Portuguese
C-ORAL BRASIL

Corpora lists

English
Spoken Corpora of English
Corpora4Learning.net
Last edited by Ruan on Thu Oct 01, 2015 10:07 am, edited 5 times in total.
6 x

User avatar
AlexTG
Green Belt
Posts: 299
Joined: Sat Jul 18, 2015 12:14 pm
Location: Tasmania, Australia
Languages: Easy to Read: English(N), French, Spanish
Able to Read: German, Latin
Learning to Read: Japanese, Hindi/Urdu
x 537

Re: Spoken Language Corpora

Postby AlexTG » Mon Sep 28, 2015 2:59 pm

Free
French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.
3 x

User avatar
emk
Black Belt - 1st Dan
Posts: 1708
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6738
Contact:

Re: Spoken Language Corpora

Postby emk » Mon Sep 28, 2015 3:07 pm

Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged.

If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.
1 x


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests