Lannang Corpus (LanCorp): A multilingual, POS-tagged, and audio-textual databank

Lannang Corpus (LanCorp): A multilingual, POS-tagged, and audio-textual databank

The Lannang Corpus (LanCorp) is a sociolinguistic POS-tagged 375,000-word speech-and-text corpus of Lannang languages based on audio recordings collected in metropolitan Manila between 2016 and 2020. It hopes to furnish scholars interested in SinoPhilippine (socio)linguistics with a contemporary, multilingual corpus (i.e., Hokkien, Tagalog, English, Lánnang-uè, Mandarin) compiled using recorded oral data primarily collected from a Sino-Philippine community in metropolitan Manila by the community: the Manila Lannangs. The publicly available corpus contains manual transcriptions (time-aligned to the audio), source language and part-of-speech tags derived using a mix of manual and computational methods, and a wide range of social metadata; it is also organized and stored systematically for easy data retrieval and (socio)linguistic analysis. Although there are existing sociolinguistic corpora, they are small in scale and were not released publicly due to lack of informant consent – LanCorp readily fills the gap.

info and download

return to projects