Data cleaning/processing

Data Import and Cleaning

HIV Deaths By Country Dataset

deaths = read_csv("data/final_hiv_deaths.csv") %>%
 janitor::clean_names() %>%
 select("location_id", "location_name", starts_with("sex"), starts_with("age"), "year", "val", "upper", "lower") %>%
 rename(country_name = location_name) %>% 
  mutate(year = as.character(year))

deaths %>%
 summarize(n_countries = n_distinct(country_name),
           n_years = n_distinct(year),
           col_name = colnames(deaths),
           n_missing = map(deaths, ~sum(is.na(.)))) %>%
 unnest(n_missing)

## # A tibble: 10 × 4
##    n_countries n_years col_name     n_missing
##          <int>   <int> <chr>            <int>
##  1         204      30 location_id          0
##  2         204      30 country_name         0
##  3         204      30 sex_id               0
##  4         204      30 sex_name             0
##  5         204      30 age_id               0
##  6         204      30 age_name             0
##  7         204      30 year                 0
##  8         204      30 val                  0
##  9         204      30 upper                0
## 10         204      30 lower                0

The deaths dataset contains HIV death count data from 204 distinct countries, in the 30 years between 1990 and 2019.

The deaths dataset contains 61200 rows and 10 columns, with each row representing a single record of number of HIV deaths in a country stratified by year, sex and age group. Variables in the deaths dataset are as follows: location_id, country_name, sex_id, sex_name, age_id, age_name, year, val, upper, lower.

GDP Per Capita By Country Dataset

gdp = read_csv("data/final_gdp_per_capita.csv", skip = 4, col_names = TRUE) %>% 
  janitor::clean_names() %>% 
  drop_na(location_id) 

names(gdp) <- sub('^x', '', names(gdp))

gdp_clean = gdp %>%  
  select("location_id", "country_name", "1990":"2019") %>% 
  pivot_longer(cols = "1990":"2019",
               names_to = "year",
               values_to = "GDP_per_capita") 

gdp_clean %>%
 summarize(n_countries = n_distinct(country_name),
           n_years = n_distinct(year),
           col_name = colnames(gdp_clean),
           n_missing = map(gdp_clean, ~sum(is.na(.)))) %>%
 unnest(n_missing)

## # A tibble: 4 × 4
##   n_countries n_years col_name       n_missing
##         <int>   <int> <chr>              <int>
## 1         200      30 location_id            0
## 2         200      30 country_name           0
## 3         200      30 year                   0
## 4         200      30 GDP_per_capita       288

The raw GDP dataset, gdp, contains GDP per capita (US dollars) from 200 distinct countries from 1960 to 2021. It contains 200 rows and 68 columns, with each row representing a single record of a country’s GDP per capital (US dollars) in a given year.

The gdp_clean dataset contains 6000 rows and 4 columns, with each row representing a single record of a country’s GDP per capital (US dollars) in a given year.

Population By Country Dataset

population = read_csv("data/final_population_total.csv", skip = 4, col_names = TRUE)  %>% 
  janitor::clean_names() %>% 
  drop_na(location_id) 

names(population) <- sub('^x', '', names(population))

population_clean = population %>%  
  select("location_id", "1990":"2019") %>% 
  pivot_longer(cols = "1990":"2019",
               names_to = "year",
               values_to = "Population") 

population_clean %>%
 summarize(n_countries = n_distinct(location_id),
           n_years = n_distinct(year),
           col_name = colnames(population_clean),
           n_missing = map(population_clean, ~sum(is.na(.)))) %>%
 unnest(n_missing)

## # A tibble: 3 × 4
##   n_countries n_years col_name    n_missing
##         <int>   <int> <chr>           <int>
## 1         200      30 location_id         0
## 2         200      30 year                0
## 3         200      30 Population         11

The raw population dataset, population, contains total population from 200 distinct countries, from 1960 to 2021. It contains 200 rows and 68 columns, with each row representing a single record of a country’s population in a given year.

The population_clean dataset contains 6000 rows and 3 columns, with each row representing a single record of a country’s population in a given year.

Merging 3 datasets to produce the “maindata” dataset for visualization and analysis

First, we merge the GDP Per Capita By Country dataset (gdp_clean) and Population By Country Dataset (population_clean) into a dataset called gdp_population. This combined dataset contains data from 200 distinct countries between 1990 and 2019.

gdp_population = inner_join(gdp_clean, population_clean, by = c("location_id", "year")) %>%
 janitor::clean_names() 

gdp_population %>%
 summarize(n_countries = n_distinct(country_name),
           n_years = n_distinct(year),
           col_name = colnames(gdp_population),
           n_missing = map(gdp_population, ~sum(is.na(.)))) %>%
 unnest(n_missing)

## # A tibble: 5 × 4
##   n_countries n_years col_name       n_missing
##         <int>   <int> <chr>              <int>
## 1         200      30 location_id            0
## 2         200      30 country_name           0
## 3         200      30 year                   0
## 4         200      30 gdp_per_capita       288
## 5         200      30 population            11

Next, we merge the gdp_population dataset with the HIV Deaths dataset (deaths) into a maindata dataset by matching on the countries and years and using left_join. We save this as a CSV for further analysis.

maindata = left_join(gdp_population, deaths, by = c("year","location_id")) %>% 
  select(-country_name.x, -age_id) %>% 
  rename(country_name = country_name.y) %>% 
  select(location_id, country_name, everything())

maindata %>% 
  summarize(n_countries = n_distinct(location_id),
            n_years = n_distinct(year),
            col_name = colnames(maindata),
            n_missing = map(maindata, ~sum(is.na(.)))) %>% 
  unnest(n_missing)

## # A tibble: 11 × 4
##    n_countries n_years col_name       n_missing
##          <int>   <int> <chr>              <int>
##  1         200      30 location_id            0
##  2         200      30 country_name           0
##  3         200      30 year                   0
##  4         200      30 gdp_per_capita      2880
##  5         200      30 population           110
##  6         200      30 sex_id                 0
##  7         200      30 sex_name               0
##  8         200      30 age_name               0
##  9         200      30 val                    0
## 10         200      30 upper                  0
## 11         200      30 lower                  0

write_csv(maindata, 'data/maindata.csv')

maindata has 60000 observations and 11 features. Basic features we are interested in include:

location_id: id corresponding to a country_name
sex_id: id corresponding to sex
sex_name: Male or Female
age_name: Aae groups including “0-9 years”, “10-24 years”, “25-49 years”, “50-74 years”, and “75+ years”.
year: Year of the records
val: Point estimate of number of HIV/AIDS deaths
upper: Upper estimate of number of HIV/AIDS deaths
lower: Lower estimate of number of HIV/AIDS deaths
gdp_per_capita: GDP per capita in 2022 US dollars
population: population total
country_name: name of country