Cluster Sampling in R

Last Updated : 25 Jul, 2025

Cluster sampling is a sampling technique used in statistics and research methodology where the population is divided into groups or clusters and then a random sample of these clusters is selected for analysis. Instead of individually sampling each element of the population, cluster sampling involves selecting entire groups or clusters and then sampling within those clusters.

How to Perform Cluster Sampling

  1. Defining the Population: We identify the target population we want to study, such as households, schools, hospitals or other relevant units.
  2. Defining the Clusters: We divide the population into clusters, which are naturally occurring groups like cities, states, schools or hospitals.
  3. Randomly Selecting Clusters: We use a random sampling method to choose a subset of clusters, ensuring each cluster has an equal chance of being selected to avoid bias.
  4. Collecting Data from Selected Clusters: We collect data from all units within the selected clusters or from a sample within each cluster, using surveys, interviews, observations or existing records.

Implementation of Cluster Sampling in R

We implement cluster sampling in R programming language by selecting groups (clusters) from a population and optionally sampling individual elements within them using one-stage, two-stage or multi-stage approaches.

1. Performing Single-Stage Cluster Sampling

We randomly select clusters and include all elements within those selected clusters.

  • set.seed: Ensures reproducibility of random results.
  • data.frame: Used to create a structured dataset.
  • paste: Concatenates strings with a separator.
  • rnorm: Generates normally distributed random values.
  • sample: Draws random values from a vector.
  • %in%: Logical operator used to filter based on selected values.
  • head: Displays the first few rows of a data frame.
R
set.seed(123)

population <- data.frame(
  Supermarket = paste("Supermarket", 1:1000, sep = "_"),
  CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)

selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)

sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]

head(sampled_data)

Output:

dataframe
Output

2. Performing Two-Stage Cluster Sampling

We first randomly select clusters and then sample individual elements within those clusters.

  • rep: Repeats elements of a vector.
  • sample: Selects random values without replacement.
  • data.frame: Creates tabular datasets for simulation.
  • %in%: Filters rows based on values from selected clusters.
  • head: Shows the top rows of the sampled output.
R
set.seed(123)

region <- data.frame(
  Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
  AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)

households <- data.frame(
  Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
  HouseholdID = rep(1:20, times = 500),
  EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)

selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)

sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]

head(sampled_households)

Output:

dataframe
Output

3. Performing Multi-Stage Cluster Sampling

We sample from multiple levels: states, counties and then from specific units within counties.

  • sample: Used repeatedly to perform multi-level sampling.
  • rep: Repeats cluster identifiers for structure.
  • rnorm: Simulates numeric data like vaccination rates.
  • %in%: Helps select nested clusters.
  • head: Outputs initial rows of final data.
R
set.seed(123)
states <- data.frame(
  State = paste("State", 1:50, sep = "_"),
  Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
  State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
  County = rep(paste("County", 1:20, sep = "_"), times = 50),
  VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers)

Output:

dataframe
Output

4. Performing Cluster Sampling on Iris Dataset

We apply two-stage cluster sampling on the iris dataset by selecting species as clusters and then selecting observations within them.

  • set.seed: Ensures reproducibility.
  • data: Loads built-in datasets.
  • unique: Identifies distinct values.
  • sample: Selects clusters and rows.
  • %in%: Filters by selected species.
  • rownames: Retrieves row names.
  • lapply: Applies a function over a list.
  • cat: Prints formatted output.
  • rbind: Combines rows.
R
set.seed(123)

data(iris)

selected_clusters <- sample(unique(iris$Species), size = 2, replace = FALSE)

cluster_sample <- iris[iris$Species %in% selected_clusters, ]

cat("\nSelected species for the first stage (Clusters):", selected_clusters)

observations_per_species <- 1
sampled_observations <- lapply(selected_clusters, function(species) {
  species_observations <- rownames(iris[iris$Species == species, ])
  sampled_observation <- sample(species_observations, size = observations_per_species, replace = FALSE)
})

cluster_sample <- iris[sampled_observations[[1]], ]
for (i in 2:length(sampled_observations)) {
  cluster_sample <- rbind(cluster_sample, iris[sampled_observations[[i]], ])
}

cat("\nSelected observations for the second stage (Individual elements):", rownames(cluster_sample))

Output:

multicluster
Output

It indicates the selected species for the first stage of the two-stage cluster sampling process.

Applications

  1. Educational Studies: Sampling schools or classrooms as clusters to study student performance.
  2. Health Surveys: Sampling medical facilities as clusters to assess patient demographics and health outcomes.
  3. Market Research: Sampling cities or neighborhoods as clusters to analyze consumer behavior and preferences.

Advantages

  1. Reduces resources required for data collection and analysis, especially in geographically dispersed populations.
  2. Easier to implement compared to other methods, suitable for large-scale surveys or studies.
  3. Convenient when the population naturally divides into clusters, facilitating access to sampling units.
  4. Focuses resources on a smaller number of clusters, improving sampling and data collection efficiency.

Disadvantages

  1. Introduces additional variability due to similarities within clusters, leading to higher sampling errors.
  2. May reduce precision compared to other methods, especially with heterogeneous clusters.
  3. Risk of bias if clusters are not representative or vary in size, affecting sample representativeness.
  4. Requires specialized statistical techniques to account for clustering and obtain unbiased estimates.
Comment

Explore