Please use this identifier to cite or link to this item:
http://bura.brunel.ac.uk/handle/2438/31048
Title: | An investigation into the generalisability of fake news detection models |
Authors: | Hoy, Nathaniel |
Advisors: | Koulouri, T Swift, S |
Keywords: | Natural Language Processing;Classification;Machine Learning |
Issue Date: | 2024 |
Publisher: | Brunel University London |
Abstract: | Fake news has emerged as a significant societal challenge, influencing public discourse, spreading disinformation, and eroding trust in democratic institutions. While supervised machine learning has become the predominant approach to addressing this issue, existing methods often struggle with generalisability. These limitations stem from an overreliance on coarsely labelled datasets, which fail to capture nuanced distinctions between fake and real news, and the widespread use of token-based features, such as Bag-of-Words, TF-IDF, Word2Vec, and BERT. These features, while effective within specific datasets, are highly sensitive to dataset biases and source-specific patterns. Traditional evaluation techniques, such as holdout testing and K-fold cross-validation, exacerbate this issue by by assuming the data is representative, an assumption often invalid when models are tested against real-world data. This thesis addresses these limitations by exploring strategies to enhance the generalisability of fake news detection models. It proposes the use of stylistic features, which focus on linguistic characteristics such as sentence structure, punctuation, readability, and persuasive language. These features are less reliant on specific word patterns and more robust to source biases. Additionally, the thesis introduces a novel set of ‘social-monetisation’ features to capture the economic motivations behind fake news. These include the presence of advertisements, social media share buttons and affiliate links. Together, these features offer a new perspective on detecting disinformation by focusing on the financial incentives driving its production. To assess generalisability, the research combines K-fold cross-validation with external validation. In this approach, models are tested internally within each fold and externally on a manually labelled dataset after every fold. This dual framework ensures performance is rigorously evaluated under both experimental conditions and real-world scenarios. By combining these strategies, the research addresses the shortcomings of traditional methods, providing a robust understanding of generalisability. Results demonstrate that token-based models, while effective within specific datasets, perform poorly in cross-dataset scenarios. In contrast, stylistic and socialmonetisation features show greater resilience to dataset-specific biases and provide a more nuanced understanding of fake news characteristics. External validation further highlights the importance of evaluating models on diverse data to assess real-world performance. This research advances fake news detection by identifying the limitations of current approaches, proposing robust feature sets, and advocating for rigorous evaluation methods. Specifically, it has made four key contributions: demonstrating the advantages of stylistic features in improving fake news detection, introducing a novel category of features focused on social dissemination behaviors and economic incentives, developing a reduced and simplified feature set to enhance generalisability and efficiency, and establishing a novel evaluation framework for assessing model performance in this domain. |
Description: | This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London |
URI: | http://bura.brunel.ac.uk/handle/2438/31048 |
Appears in Collections: | Computer Science Dept of Computer Science Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
FulltextThesis.pdf | 2.17 MB | Adobe PDF | View/Open |
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.