Mapping Urban Multilingualism through Twitter

Enrique Manjavacas & Ben Verhoeven

http://emanjavacas.github.io/slides/

1. Research question
2. Dataset
3. Validation
4. Language-based mobility patterns
5. Visualization
6. Future work
7. The End

1 Research question

How to gain direct insight into actual language use in urban landscapes

Linguistic landscaping? (e.g Vanderbroucke 2014)

Language use is neither easily observed nor quantified

Why not use Twitter? Geolocation available since November 2009

2 Dataset

2.1 Data collection

 "place":{
    "country":"United States",
    "place_type":"city",
    "country_code":"US",
    "bounding_box":{
        "type":"Polygon",
        "coordinates":[[[-84.647561,37.031352],[-84.564839,37.031352],
                         [-84.564839,37.117608],[-84.647561,37.117608]]]
    },
    "full_name":"Somerset, KY",
    "name":"Somerset",
    "id":"0c610ec760ff6a57"
 }

Use the Twitter Streaming API to collect data for four cities:

-> Amsterdam: highly international

-> Antwerp: geographically small but highly multilingual

-> Berlin: large spread and decentralized

-> Brussels: bilingualism

2.2 Dataset size

Mining since December 2014

City	Number of tweets	Box size
Amsterdam	679205	23,44
Antwerp	415813	83,03
Berlin	691998	45,01
Brussels	497667	46,94
Total	2284683

Twitter API offers access to ~1% tweets

Georeferenced tweets constitute 1,6% from the stream (Leetaru et al. 2012)

2.3 Data processing

2.3.1 Language identification

Large accuracy needed

Twitter’s own identification system is not really trustworthy (speed over reliability)

Implementation of the setup in (Lui & Baldwin 2014)

Removal of Twitter-specific patterns #hashtags, @usernames, and http://links.co

Majority vote on three systems:

langid.py	97 languages
CLD2	> 80 languages
LangDetect	53 languages

2.3.2 Bot detection

Only known available system is Bot or Not? (no API)

Bots are in general more productive than humans (Chu et al. 2010)
Bots tweet more evenly distributed across time (ibid.)

Look for the 100 accounts with largest number of tweets in the database
16 bots accounting for 209,648 (1/6) tweets in the Berlin subset.

3 Validation

Gap between Twitter data and some extern statistic based on real values

Difficult to find bias-free dataset of language use in cities

3.1 Problems

Double inference

From nationality to native language

From native language to actual twitter language

Overrepresentation of prolific users/languages

English

Underrepresentation of certain languages

National Twitter-usage patterns (e.g Romanian)

Illiteracy (minor communities, e.g. Somali)

(from urbanmovements.co.uk)

3.1.1 Dataset

http://statistik-berlin-brandenburg.de/

"Nationality","District","Migratory_background","Foreigners"
"fr","01 Mitte",782,3262
"fr","0101 Zentrum",385,1478
"fr","010111 Tiergarten Süd",63,218
"fr","01011101 Stülerstraße",9,30
"fr","01011102 Großer Tiergarten",0,3
"fr","01011103 Lützowstraße",35,94
"fr","01011104 Körnerstraße",16,79
"fr","01011105 Nördlicher Landwehrkanal",3,12
"fr","010112 Regierungsviertel",22,103

3.2 Multigroup segregation measures

Correlate neighbourhood segregation scores across datasets

3.2.1 Segregation measures

Use in segregational studies to compare cities according to income, race…

Average difference between disproportionality in organizational unit and total disproportionality all divided by total disproportionality

General mathematical definition (as in Reardon & Firebaugh 2002)

\(H = \sum\limits_{j=1}^J \frac{t_j}{TE}(E - E_j)\)

E: disproportionality function (e.g. entropy)
\(t_j\): number of individuals in organizational unit j
\(E_j\): entropy in organizational unit j
\(E\): total entropy (across organizational units)
\(T\): total number of individuals

3.2.2 Results

\(R^2\) 0.87

4 Language-based mobility patterns

Can we establish mobility patterns based on the language of tweets?

4.1 Sampling

Discard users with less than 35 tweets in a single language

Discard users with a l1/l2-proportion of less than 4/1

—

languages	en	ar	es	ru	pt	tr	de
# Users	592	60	64	42	37	111	747

Samples of size ~35 users per language, ~35 points per user

4.2 Computations

Estimate mobility as the average distance from a user centroid

\(\sqrt[]{\frac{1}{n}\sum\limits_{i=1}^n haversin(a_i - \bar{a_i})^2}\)

4.3 Results

One-way ANOVA

:              Df Sum Sq Mean Sq F value   Pr(>F)    
: lang          6  164.6  27.437   4.138 0.000516 ***
: Residuals   308 2042.4   6.631                     
: ---
: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Single T-test

language pair	`p-value`
tr de	0.0002
ar de	0.0003
tr ar	0.9903
pt es	0.9518

5 Visualization

6 Future work

6.1 Analyse locality patterns

How tight together are speakers of the same language?

Consider multiple clusters (language centroids)

6.2 Tourism detection

Assign users to the country from which the majority of their tweets came. (Hawelka et al. 2014)

Focused sampling of user timelines.

7 The End

Thank You for your attention

Measures of multigroup segregation Reardon, Sean F. & Firebaugh, G. Sociological methodology 2002
The measurement of spatial segregation White, Michael J American journal of sociology 1983
Mapping German Tweets to Geographic Regions Scheffler, T KONVENS 2014
Mapping the global Twitter heartbeat: The geography of Twitter Leetaru, K. First Monday 2013

Geo-located Twitter as proxy for global mobility patterns Hawelka et al. 2014
Accurate language identification of twitter messages Lui, M. & Baldwin T. EACL 2014
Short text language detection with infinity-gram Shuyo Nakatani 2012
Language detection library Shuyo Nakatani 2010

Accuracy and performance of google’s compact language detector McCandless M. 2010
langid.py: An off-the-shelf language identification tool Lui, M & Baldwin, T. Proceedings of the ACL 2012 2012
Language visibility, functionality and meaning across various TimeSpace scales in Brussels’ multilingual landscapes Vandenbroucke, Mieke 2014