Mapping Urban Multilingualism through Twitter

Enrique Manjavacas & Ben Verhoeven

http://emanjavacas.github.io/slides/

1 Research question

  • How to gain direct insight into actual language use in urban landscapes
  • Linguistic landscaping? (e.g Vanderbroucke 2014)
  • Language use is neither easily observed nor quantified

2 Dataset

2.1 Data collection

 "place":{
    "country":"United States",
    "place_type":"city",
    "country_code":"US",
    "bounding_box":{
        "type":"Polygon",
        "coordinates":[[[-84.647561,37.031352],[-84.564839,37.031352],
                         [-84.564839,37.117608],[-84.647561,37.117608]]]
    },
    "full_name":"Somerset, KY",
    "name":"Somerset",
    "id":"0c610ec760ff6a57"
 }

-> Amsterdam: highly international

-> Antwerp: geographically small but highly multilingual

-> Berlin: large spread and decentralized

-> Brussels: bilingualism

2.2 Dataset size

Mining since December 2014

City Number of tweets Box size
Amsterdam 679205 23,44
Antwerp 415813 83,03
Berlin 691998 45,01
Brussels 497667 46,94
Total 2284683  
  • Twitter API offers access to ~1% tweets
  • Georeferenced tweets constitute 1,6% from the stream (Leetaru et al. 2012)

2.3 Data processing

2.3.1 Language identification

  • Large accuracy needed
  • Twitter’s own identification system is not really trustworthy (speed over reliability)
  • Implementation of the setup in (Lui & Baldwin 2014)
  • Removal of Twitter-specific patterns #hashtags, @usernames, and http://links.co
  • Majority vote on three systems:
langid.py 97 languages
CLD2 > 80 languages
LangDetect 53 languages

2.3.2 Bot detection

angrybot.png

Only known available system is Bot or Not? (no API)

  • Bots are in general more productive than humans (Chu et al. 2010)
  • Bots tweet more evenly distributed across time (ibid.)
  • Look for the 100 accounts with largest number of tweets in the database
  • 16 bots accounting for 209,648 (1/6) tweets in the Berlin subset.

3 Validation

  • Gap between Twitter data and some extern statistic based on real values
  • Difficult to find bias-free dataset of language use in cities

3.1 Problems

Double inference

  • From nationality to native language
  • From native language to actual twitter language

Overrepresentation of prolific users/languages

  • English

Underrepresentation of certain languages

  • National Twitter-usage patterns (e.g Romanian)
  • Illiteracy (minor communities, e.g. Somali)

illiteracy.png

(from urbanmovements.co.uk)

3.1.1 Dataset

http://statistik-berlin-brandenburg.de/

"Nationality","District","Migratory_background","Foreigners"
"fr","01 Mitte",782,3262
"fr","0101 Zentrum",385,1478
"fr","010111 Tiergarten Süd",63,218
"fr","01011101 Stülerstraße",9,30
"fr","01011102 Großer Tiergarten",0,3
"fr","01011103 Lützowstraße",35,94
"fr","01011104 Körnerstraße",16,79
"fr","01011105 Nördlicher Landwehrkanal",3,12
"fr","010112 Regierungsviertel",22,103

3.2 Multigroup segregation measures

  • Correlate neighbourhood segregation scores across datasets

3.2.1 Segregation measures

  • Use in segregational studies to compare cities according to income, race…
  • Average difference between disproportionality in organizational unit and total disproportionality all divided by total disproportionality
  • General mathematical definition (as in Reardon & Firebaugh 2002)

\(H = \sum\limits_{j=1}^J \frac{t_j}{TE}(E - E_j)\)

  • E: disproportionality function (e.g. entropy)
  • \(t_j\): number of individuals in organizational unit j
  • \(E_j\): entropy in organizational unit j
  • \(E\): total entropy (across organizational units)
  • \(T\): total number of individuals

3.2.2 Results

segregation_corr.png

\(R^2\) 0.87

4 Language-based mobility patterns

Can we establish mobility patterns based on the language of tweets?

4.1 Sampling

  • Discard users with less than 35 tweets in a single language
  • Discard users with a l1/l2-proportion of less than 4/1

languages en ar es ru pt tr de
# Users 592 60 64 42 37 111 747
  • Samples of size ~35 users per language, ~35 points per user

4.2 Computations

  • Estimate mobility as the average distance from a user centroid

\(\sqrt[]{\frac{1}{n}\sum\limits_{i=1}^n haversin(a_i - \bar{a_i})^2}\)

4.3 Results

One-way ANOVA

:              Df Sum Sq Mean Sq F value   Pr(>F)    
: lang          6  164.6  27.437   4.138 0.000516 ***
: Residuals   308 2042.4   6.631                     
: ---
: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mean_error_bars.png

Single T-test

language pair p-value
tr de 0.0002
ar de 0.0003
tr ar 0.9903
pt es 0.9518

5 Visualization

antwerp.png

6 Future work

6.1 Analyse locality patterns

  • How tight together are speakers of the same language?
  • Consider multiple clusters (language centroids)

6.2 Tourism detection

  • Assign users to the country from which the majority of their tweets came. (Hawelka et al. 2014)
  • Focused sampling of user timelines.

num_ids_date.png

7 The End

Thank You for your attention

  • Measures of multigroup segregation Reardon, Sean F. & Firebaugh, G. Sociological methodology 2002
  • The measurement of spatial segregation White, Michael J American journal of sociology 1983
  • Mapping German Tweets to Geographic Regions Scheffler, T KONVENS 2014
  • Mapping the global Twitter heartbeat: The geography of Twitter Leetaru, K. First Monday 2013
  • Accuracy and performance of google’s compact language detector McCandless M. 2010
  • langid.py: An off-the-shelf language identification tool Lui, M & Baldwin, T. Proceedings of the ACL 2012 2012
  • Language visibility, functionality and meaning across various TimeSpace scales in Brussels’ multilingual landscapes Vandenbroucke, Mieke 2014