Twitter Corpus of Philippine Englishes (TCOPE)
The Twitter Corpus of Philippine Englishes (TCOPE) is a 135-million-word corpus created from roughly 27 million public tweets sampled from 29 major cities in the Philippines. It is now available for download via OSF: https://osf.io/3q5pw/wiki/home/
Contains the following metadata (as reflected in tags)
twitter user id
month of tweet
year of tweet
unique corpus line id
Has different formats
hierarchical text format (txt): primed for concordance software including AntConc, CasualConc
spreadsheet format (csv): primed for analysis using popular data analysis tools such as R and Python
Is tagged for part-of-speech using spaCy
Contains dependency parsing information derived from spaCy
I have an overview paper that was just accepted for publication in English World-Wide, scheduled to be published in 2023. In that paper, I first discuss the considerations that went into TCOPE’s design, the compilation procedure, the format, and access. Then, I demonstrate how it can be used to examine the linguistic features of Philippine English (PhilE) as well as the relationship between these features and other language-internal and language-external factors (e.g., ethno-geographic region, time, age, sex) insightfully. The paper focuses on four documented PhilE features: (1) the use of irregular past tense morpheme -t, (2) double comparatives, (3) subjunctive were in subordinate counterfactual clauses, and (4) the phrasal verb base from. A distributional analysis of these features without considering other factors generally indicated similar patterning as previous work. A deeper analysis of the data using Bayesian multivariate regression revealed structured heterogeneity within PhilE, pointing to the multifaceted and dynamic nature of the variety. Because of its large size, sampling distribution, and its availability in different formats, TCOPE can be used to investigate ‘general’ contemporary PhilE as well as different types of variation within this PhilE. It can broaden horizons in the diachronic and sociolinguistic study of Philippine English(es).
You can find the pre-print here.
Please cite the overview paper if you use my corpus or mention it in your work.
Gonzales, Wilkinson Daniel Wong. In press. Broadening horizons in the diachronic and sociolinguistic study of Philippine English with the Twitter Corpus of Philippine Englishes (TCOPE). English World-Wide, John Benjamins.