The Department of Statistical Sciences (STA) and the Clinton Health Access Initiative (CHAI) have come together to provide an exciting opportunity for Masters students through a fellowship programme. The programme provides an opportunity for students in their dissertation year to analyse government data to answer a policy-relevant question enabling them to receive work experience that is aligned with their dissertation. MASHA facilitated the establishment of this programme and is proud to present the STA-CHAI fellowship students below grouped by cohorts.

2023 Cohort

Jared Tavares

Home Country: South Africa
E-mail: jared.tavares@uct.ac.za

Education:

BusSci Analytics (University of Cape Town)
Honours in Statistics (University of Cape Town)
MSc Statistics and Data Science (University of Cape Town, current)

Thesis Title: Analysing fuel transactions of government vehicles in the Eastern Cape, South Africa

Abstract: Fuel management and fraud detection in government fleets are critical issues that have far-reaching financial and operational implications. To address these challenges, an investigation of fuel usage patterns and anomalies in the Eastern Cape Province government fleet in South Africa from April 2021 to January 2022 was conducted. Through the application of exploratory data analysis, clustering techniques, and predictive modelling, the research uncovers valuable insights that can be used to optimise fuel consumption and detect fraudulent activities within the fleet. Univariate and bivariate analyses reveal distinct patterns in fleet composition, transaction volumes, and fuel efficiency across various vehicle makes, model derivatives, and departments. The use of clustering techniques enables the identification of distinct vehicle segments and transaction patterns, emphasising the importance of considering contextual factors when analysing fuel usage. To detect potential fraud, three key indicators are developed: abnormally large transactions, frequent transactions, and fuel price differences. Predictive models, including XGBoost, Multi-layer Perceptron, and Random Forest, are employed to automate the classification of transactions based on these fraud indicators. The Multi-layer Perceptron demonstrates the best performance, achieving an accuracy of 87\% on the test set. The dissertation acknowledges limitations due to the scope of the data and missing information for certain geographic variables such as district and site. Future research could expand the geographical and temporal range, incorporate qualitative data, explore real-time monitoring systems, and investigate vehicle maintenance and fuel efficiency. The present research makes a noteworthy contribution to the knowledge of fuel management and fraud detection in government fleets by offering a data-driven approach to expose inefficiencies and anomalies. The insights and methodologies presented serve as a foundation for future research and practical applications, ultimately leading to more efficient, cost-effective, and transparent fleet operations.

Juandre Liebenberg

Home Country: South Africa
E-mail: LBNJUA001@myuct.ac.za

Education:

BEng Mechanical Engineering (University of Johannesburg)
MSc Data Science (University of Cape Town) – In Progress

Thesis Title: Predicting District Level HIV Prevalence in South Africa Using Medicine Ordering Data

Abstract: The Human Immunodeficiency Virus has been at the forefront of South Africa’s public health challenges, placing the healthcare system under immense pressure. As a result of HIV planning by policymakers, more than 5.5 million People Living with HIV have access to antiretroviral treatment at present day. Dynamic, mechanistic models such as the Thembisa and Naomi Bayesian models have been used to generate provincial and district-level estimates such as HIV prevalence, People Living with HIV, and the number of residents on antiretroviral treatment. An alternative methodology for estimating drug utilisation and predicting HIV estimates was explored by using medicine ordering data as the primary input for analysis from 2020 to 2022. Two objectives were set out, the first being a drug utilisation analysis aimed at approximating the number of individuals per 1000 inhabitants per day taking antiretroviral drugs to determine if the adequate stock was ordered at district and provincial levels. The second was to predict HIV prevalence by fitting panel data and spatial linear models to predict district prevalence and People Living with HIV; the estimations for People Living with HIV were converted to prevalence to compare the direct estimation of prevalence to the calculated. Results from the drug utilisation analysis suggested that district municipalities hold insufficient stock to meet the demands of those inflicted with the disease. In contrast, larger metropolitan municipalities hold excess medication, implying that people travel across district boundaries to receive treatment. The fitted spatial models generated better prevalence estimates than fixed-effect panel data models for the predicted and calculated prevalence with root mean square error metrics of 0.009 (0.87%) and 0.012(1.24%) compared to that of 0.012(1.21%) and 0.015(1.53%) from the fixed-effect panel data models. The impact of high quantities of antiretroviral drugs ordered by metropolitan municipalities resulted in an underestimation of prevalence in those regions due to the negative relationship between the dependent variable Prevalence and the independent Quantity variable. With the best-performing spatial model accurately estimating 51 out of 52 district prevalence figures within the acceptable range, the results of the study have shown the use of ordering data to predict disease prevalence has the potential to serve as an alternative methodology in the absence of established models.

Magnolia Chikanya

Home Country: Zimbabwe
E-mail: CHKMAG002@myuct.ac.za

Thesis Title: Analysis of gender wage gap using mixed effects models

Abstract: Despite government interventions, the gender wage gap persists in workplaces. While reports on whether the gap is widening or narrowing vary, addressing this issue remains crucial. Traditionally, researchers have employed methods like the Blinder-Oaxaca decomposition and quantile regression to estimate the gender wage gap. However, these approaches often leave a high unexplained variance attributed to discrimination. In existing studies, gender wage gap estimates have typically been aggregated, and attempts to disaggregate the analysis have focused on broader levels such as occupations and salary bands. To delve deeper, human resource data from the national health department in South Africa Eastern Cape province was leveraged. The goal was to analyze the gender wage gap for each job title using a novel approach: linear mixed effects regression. The linear mixed effects model captures variability not directly related to the dependent variables by accounting for variability within and across employees and job titles simultaneously to provide a more comprehensive understanding of the gender wage gap. Here are the key findings: 1. The unexplained variance in gender wage gap was remarkably low, accounting for only 3% of the total stochastic variance. This suggests that factors beyond job titles play a minor role in explaining wage disparities. 2. Job titles emerged very significant by explaining 83% of the total random variance. This highlights the significance of considering specific roles when analyzing gender wage gap. 3. Over time, interesting patterns were observed. From 2010, the gender wage gap narrowed, but starting around 2015, it gradually widened again. 4. Encouragingly, 42% of the job title groups showed a gender wage gap in favour of women. Additionally, a substantial proportion of females occupied managerial and highly skilled positions. Therefore, incorporating random effects techniques through linear mixed effects regression enriched the analysis of gender wage gap. By examining job titles individually, detailed insights into this complex issue were gained. These findings underscore the importance of considering both fixed and random effects when studying wage disparities.

2023 Cohort

2023 Cohort

​Jared Tavares

Juandre Liebenberg

Magnolia Chikanya

Jared Tavares