Sampling function in R

Challenge

Many times, we needed to extract a sample from a database for analysis. The main challenge was to automate this entire process, where we input a CSV, and the function would return the sampled data.

Deliverable

We created an R function that calculates the necessary sample size and extracts a simple random sample (SRS) from the original dataset. If you want to see the code for this function, feel free to check my GitHub. Below is a more detailed explanation of the sample size calculation used

Simple Random Sample (SRS)

For the calculation of the size of a Simple Random Sample (SRS), where all individuals in the population have an equal probability of being selected to participate in the sample, the formula proposed by COCHRAN (1977) was used:

n= n0/(1+n0/N), and n0=(1-p)×p×zα^2/ε^2

where n is the sample size, N is the population size, p is the proportion (or prevalence), zα is the value of the Z statistic (quantile of the standard normal distribution) for a confidence level α, and ε is the allowable sampling error (or margin of error). According to (BOLFARINE; BUSSAB, 2012), a 95% confidence level is commonly used, so the quantile of the standard normal distribution (zα) is approximately 1.96, and a priori, a population proportion (p) of 50% is assumed. The smaller the allowable sampling error and the higher the confidence level, the larger the sample size to avoid any bias in a survey. The sample size calculation proposed by COCHRAN (1977) is widely used in studies where it is desired to conduct a sample survey, but no population parameters are known."