As this is a resource that is rarely if ever mentioned, I thought I'd be worthy of a thread.
Linguists all throughout the world are compiling spoken language corpora and making them available on the internet; some of them are available for download. Although they have compiled these corpora for their own research purposes, they're obviously very useful for the language learner, as they provide him with exposure to everyday speech altogether with accurate annotated transcriptions.
I'll edit this post with any language corpus somebody may post in this thread. If a significant amount of links get posted, I'll create a wikia article on the subject. For the sake of convenience, classify your links as "Free" ; "Partially free" and "Paid" while posting, and also by language.
Free
American English
Santa Barbara Corpus of Spoken American English. About 23 hours of spontaneous conversation recordings.
Australian English
PYB007 is a class held by Australian universities for psychology students, in which students must record a conversation with somebody else to improve their interpersonal communication skills. The good news is that students are posting these conversations on Youtube. There are no transcriptions.Click here to google for these videos.
British English
Saarbrücken Corpus of Spoken English
Danish
SamTale Bank inside TalkBank ( see Multilingual ).
French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.
Multilingual
OPUS, the open parallel corpus
Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged. If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.
TalkBank. CABank has lots of languages, western and eastern.
Partially free
American English
Charlotte Narratives - New South Voices - J. Murrey Atkins Library Special Collections
Paid
Brazilian Portuguese
C-ORAL BRASIL
Corpora lists
English
Spoken Corpora of English
Corpora4Learning.net
Spoken Language Corpora
-
- Yellow Belt
- Posts: 99
- Joined: Thu Aug 27, 2015 1:05 pm
- Languages: Brazilian Portuguese (N), English
- Language Log: viewtopic.php?f=15&t=1370
- x 154
Spoken Language Corpora
Last edited by Ruan on Thu Oct 01, 2015 10:07 am, edited 5 times in total.
6 x
- AlexTG
- Green Belt
- Posts: 299
- Joined: Sat Jul 18, 2015 12:14 pm
- Location: Tasmania, Australia
- Languages: Easy to Read: English(N), French, Spanish
Able to Read: German, Latin
Learning to Read: Japanese, Hindi/Urdu - x 537
Re: Spoken Language Corpora
Free
French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.
French
Projet PFC: includes readings of set texts and word lists + free flowing conversation with transcripts + demographic data for participants, eg age, education, occupation, regions lived in throughout life, parents' occupations. Detail of the demographic data varies between countries. Covers a large swathe of the Francophone world.
3 x
- emk
- Black Belt - 1st Dan
- Posts: 1708
- Joined: Sat Jul 18, 2015 12:07 pm
- Location: Vermont, USA
- Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish. - Language Log: viewtopic.php?f=15&t=723
- x 6738
- Contact:
Re: Spoken Language Corpora
Also worth checking out: OPUS, the open parallel corpus. This tends to lean heavily towards written sources, but it covers gigabytes of text in many major languages, all pre-aligned and often tagged.
If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.
If you're specifically looking for speech data, the OPUS OpenSubtitles corpus contains an enormous number of movies and TV shows. Again, this will work better for larger languages, and the speech is almost all professional actors. But for a language like French, with a huge media presence and a tradition of high-quality dubbing, it's a potentially useful resource. Note that some programming is generally required; OPUS tends to format the corpus for use by machines and not by humans.
1 x
Return to “Language Programs and Resources”
Who is online
Users browsing this forum: No registered users and 2 guests