The application of machine learning, big data techniques, and criminology to the analysis of racist tweets.

Day, Ed (2018) The application of machine learning, big data techniques, and criminology to the analysis of racist tweets. Ph.D. thesis, Canterbury Christ Church University.

Final thesis (002).pdf - Accepted Version

Download (3MB) | Preview
PDF (Access Form)
Access form.pdf - Supplemental Material

Download (131kB) | Preview


Racist tweets are ubiquitous on Twitter. This thesis aims to explore the creation of an automated system to identify tweets and tweeters, and at the same time gain a theoretical understanding of the tweets. To do this a mixed methods approach was employed: machine learning was utilised to identify racist tweets and tweeters, and grounded theory and other qualitative techniques were used to gain an understanding of the tweets’ content.

84 million tweets that all contained racist words were collected from Twitter. 84,000 of these were hand annotated as racist or not.

The machine learning was performed in a Hadoop cluster, utilising Spark and Hive. To identify racist tweets, systematic comparison of seven different algorithms, and a large number of textual, user derived and geographical features was performed. New features: time of day and day of week were also evaluated. The 84,000 hand annotated tweets were used as input to the machine learning supervised classification processes. It was found that the combination of support vector machines with hour of day as additional feature was optimal for accuracy (0.93) and AUPRC (0.86).

A qualitative exploration of tweets was also performed, including a grounded theory analysis.

A novel machine learning system to identify racist accounts was created using metrics from the racist tweets, concepts from the grounded theory and a combination of the two as feature inputs. All three sets of features gave accuracy of at least 0.82.

The ambiguity of the tweets meant they were difficult to classify, for both humans and machines, as to whether the tweeter’s intentions were racist or not, the word ‘nigga’ being particularly problematic.

Grounded theory analysis of the tweets showed extremely narrow rhetoric that could be summarised in a single theoretical concept: the defence of the in-group.

Item Type: Thesis (Doctoral)
Subjects: H Social Sciences > HT Communities. Classes. Races
H Social Sciences > HV Social pathology. Social and public welfare. Criminology > HV6001 Criminology
Divisions: Faculty of Social and Applied Sciences > School of Law, Criminal Justice and Computing
Depositing User: Miss Rosemary Cox
Date Deposited: 06 Jun 2019 13:05
Last Modified: 06 Jun 2019 13:05

Actions (login required)

Update Item (CReaTE staff only) Update Item (CReaTE staff only)


Downloads per month over past year

View more statistics


Connect with us

Last edited: 29/06/2016 12:23:00