3 Survey Methods

This chapter introduces statistical methods for the analysis of complex survey data. We focus on two widely used data sources in global and population health:

  1. National Health and Nutrition Examination Survey (NHANES)
  2. Demographic and Health Surveys (DHS)

We start by briefly describing survey design features—stratification, clustering, and weighting—and why they matter for valid estimation and inference. We then demonstrate:

  • How to import and prepare NHANES and DHS data in R
  • How to specify the survey design using the survey package
  • How to obtain design-corrected descriptive statistics and regression models

The emphasis is on reproducible workflows that you can adapt to your own survey analyses.

3.1 Demographic Health Surveys (DHS)

Demographic Health Surveys (DHS) is a survey that is implemented in many developing countries. DHS uses a cluster sampling methodology to obtain nationally representative data on population, health, and nutrition. Once completed, results are published and can be accessible by researchers. When access is given, a researcher can download and save data files. The next process of using data required statistical skills that will be described here.

3.1.1 Data importing

The analysis utilizes data from five waves of the Rwanda Demographic and Health Surveys (DHS) spanning two decades (2000–2020). I imported the official SAS-formatted raw data files (.SAS7BDAT) for each cycle year into R using the haven package, verifying the file dimensions upon import for consistency.

The first step is to explore the file. The command was used to list the files that are contained within the specified directory path.

list.files("/Users/corneille/Desktop/Mike lab/DHS/1992/RWBR21SD")
## [1] "RWBR21FL.frq"      "RWBR21FL.frw"      "RWBR21FL.MAP"      "RWBR21FL.SAS"      "RWBR21FL.SAS7BDAT"

The library(haven) loads the haven package into the current R session.The haven package is essential because it provides functions (like read_sas) necessary to import data files saved in formats used by other statistical software, such as SAS (Statistical Analysis System), SPSS, and Stata.

The following lines read and load five different SAS data files (indicated by the .SAS7BDAT file extension) into the R environment, creating five separate data frames:

library(haven)
data2000 <- read_sas("/Users/corneille/Desktop/Mike lab/DHS/2000/RWBR41SD/RWBR41FL.SAS7BDAT")

data2005 <- read_sas("/Users/corneille/Desktop/Mike lab/DHS/2005/RWBR53SD/RWBR53FL.SAS7BDAT")

data2010 <- read_sas("/Users/corneille/Desktop/Mike lab/DHS/2010/RWBR61SD/RWBR61FL.SAS7BDAT")

data2015 <- read_sas("/Users/corneille/Desktop/Mike lab/DHS/2015/RWBR70SD/RWBR70FL.SAS7BDAT")

data2020 <- read_sas("/Users/corneille/Desktop/Mike lab/DHS/2020/RWBR81SD/RWBR81FL.SAS7BDAT")

Finally, The five commands were executed to inspect the dimensions (number of rows and columns) of each of the newly loaded data frames. This is a crucial initial step to verify that the data was imported correctly and to understand the size of each dataset (rows = number of observations/survey participants; columns = number of variables/questions).

dim(data2000)
## [1] 27602   934
dim(data2005)
## [1] 30072  1077
dim(data2010)
## [1] 32639  1021
dim(data2015)
## [1] 30058  1078
dim(data2020)
## [1] 30820  1216