deaths = read_csv("data/final_hiv_deaths.csv") %>%
janitor::clean_names() %>%
select("location_id", "location_name", starts_with("sex"), starts_with("age"), "year", "val", "upper", "lower") %>%
rename(country_name = location_name) %>%
mutate(year = as.character(year))
deaths %>%
summarize(n_countries = n_distinct(country_name),
n_years = n_distinct(year),
col_name = colnames(deaths),
n_missing = map(deaths, ~sum(is.na(.)))) %>%
unnest(n_missing)
## # A tibble: 10 × 4
## n_countries n_years col_name n_missing
## <int> <int> <chr> <int>
## 1 204 30 location_id 0
## 2 204 30 country_name 0
## 3 204 30 sex_id 0
## 4 204 30 sex_name 0
## 5 204 30 age_id 0
## 6 204 30 age_name 0
## 7 204 30 year 0
## 8 204 30 val 0
## 9 204 30 upper 0
## 10 204 30 lower 0
The deaths dataset contains HIV death count data from 204 distinct countries, in the 30 years between 1990 and 2019.
The deaths dataset contains 61200 rows and 10 columns, with each row representing a single record of number of HIV deaths in a country stratified by year, sex and age group. Variables in the deaths dataset are as follows: location_id, country_name, sex_id, sex_name, age_id, age_name, year, val, upper, lower.
gdp = read_csv("data/final_gdp_per_capita.csv", skip = 4, col_names = TRUE) %>%
janitor::clean_names() %>%
drop_na(location_id)
names(gdp) <- sub('^x', '', names(gdp))
gdp_clean = gdp %>%
select("location_id", "country_name", "1990":"2019") %>%
pivot_longer(cols = "1990":"2019",
names_to = "year",
values_to = "GDP_per_capita")
gdp_clean %>%
summarize(n_countries = n_distinct(country_name),
n_years = n_distinct(year),
col_name = colnames(gdp_clean),
n_missing = map(gdp_clean, ~sum(is.na(.)))) %>%
unnest(n_missing)
## # A tibble: 4 × 4
## n_countries n_years col_name n_missing
## <int> <int> <chr> <int>
## 1 200 30 location_id 0
## 2 200 30 country_name 0
## 3 200 30 year 0
## 4 200 30 GDP_per_capita 288
The raw GDP dataset, gdp, contains GDP per capita (US dollars) from 200 distinct countries from 1960 to 2021. It contains 200 rows and 68 columns, with each row representing a single record of a country’s GDP per capital (US dollars) in a given year.
The gdp_clean dataset contains 6000 rows and 4 columns, with each row representing a single record of a country’s GDP per capital (US dollars) in a given year.
population = read_csv("data/final_population_total.csv", skip = 4, col_names = TRUE) %>%
janitor::clean_names() %>%
drop_na(location_id)
names(population) <- sub('^x', '', names(population))
population_clean = population %>%
select("location_id", "1990":"2019") %>%
pivot_longer(cols = "1990":"2019",
names_to = "year",
values_to = "Population")
population_clean %>%
summarize(n_countries = n_distinct(location_id),
n_years = n_distinct(year),
col_name = colnames(population_clean),
n_missing = map(population_clean, ~sum(is.na(.)))) %>%
unnest(n_missing)
## # A tibble: 3 × 4
## n_countries n_years col_name n_missing
## <int> <int> <chr> <int>
## 1 200 30 location_id 0
## 2 200 30 year 0
## 3 200 30 Population 11
The raw population dataset, population, contains total population from 200 distinct countries, from 1960 to 2021. It contains 200 rows and 68 columns, with each row representing a single record of a country’s population in a given year.
The population_clean dataset contains 6000 rows and 3 columns, with each row representing a single record of a country’s population in a given year.
First, we merge the GDP Per Capita By Country dataset (gdp_clean) and Population By Country Dataset (population_clean) into a dataset called gdp_population. This combined dataset contains data from 200 distinct countries between 1990 and 2019.
gdp_population = inner_join(gdp_clean, population_clean, by = c("location_id", "year")) %>%
janitor::clean_names()
gdp_population %>%
summarize(n_countries = n_distinct(country_name),
n_years = n_distinct(year),
col_name = colnames(gdp_population),
n_missing = map(gdp_population, ~sum(is.na(.)))) %>%
unnest(n_missing)
## # A tibble: 5 × 4
## n_countries n_years col_name n_missing
## <int> <int> <chr> <int>
## 1 200 30 location_id 0
## 2 200 30 country_name 0
## 3 200 30 year 0
## 4 200 30 gdp_per_capita 288
## 5 200 30 population 11
Next, we merge the gdp_population dataset with the
HIV Deaths dataset (deaths) into a
maindata dataset by matching on the countries and years
and using left_join
. We save this as a CSV for further
analysis.
maindata = left_join(gdp_population, deaths, by = c("year","location_id")) %>%
select(-country_name.x, -age_id) %>%
rename(country_name = country_name.y) %>%
select(location_id, country_name, everything())
maindata %>%
summarize(n_countries = n_distinct(location_id),
n_years = n_distinct(year),
col_name = colnames(maindata),
n_missing = map(maindata, ~sum(is.na(.)))) %>%
unnest(n_missing)
## # A tibble: 11 × 4
## n_countries n_years col_name n_missing
## <int> <int> <chr> <int>
## 1 200 30 location_id 0
## 2 200 30 country_name 0
## 3 200 30 year 0
## 4 200 30 gdp_per_capita 2880
## 5 200 30 population 110
## 6 200 30 sex_id 0
## 7 200 30 sex_name 0
## 8 200 30 age_name 0
## 9 200 30 val 0
## 10 200 30 upper 0
## 11 200 30 lower 0
write_csv(maindata, 'data/maindata.csv')
maindata has 60000 observations and 11 features. Basic features we are interested in include:
location_id
: id corresponding to a country_namesex_id
: id corresponding to sexsex_name
: Male or Femaleage_name
: Aae groups including “0-9 years”, “10-24
years”, “25-49 years”, “50-74 years”, and “75+ years”.year
: Year of the recordsval
: Point estimate of number of HIV/AIDS deathsupper
: Upper estimate of number of HIV/AIDS deathslower
: Lower estimate of number of HIV/AIDS deathsgdp_per_capita
: GDP per capita in 2022 US dollarspopulation
: population totalcountry_name
: name of country