Sampling
Sampling is the selection of a subset of data from within a statistical population to estimate characteristics of the whole population.
Sampling Factors
Application - the use to which the data will be put
Availability - the availability of data sources and the data itself
Bias - aka selection bias, in which proper randomization is not achieved
Cost - the cost and resources needed to collecting, storing and maintaining data
Representation - the need to include or exclude specific groups of data
Methods of Sampling
Which method of sampling is chosen depends on sampling factors such as those shown above. Methods of sampling include:
Cluster Sampling - samples are selected from data organized into clusters such as by geography
Convenience Sampling - samples are selected from data close at hand
Quota Sampling - samples are selected by quota from data organized into multiple groups
Simple Random Sampling - samples are chosen by chance
Snowball Sampling - samples are selected from an initial group and then groups identified by the initial group
Stratified Sampling - samples are selected from data organized into multiple groups
Systematic Sampling - aka interval sampling, samples are selected at regular intervals from an ordered list
Sample Size Determination
Sample size determination is an important factor is the sampling process. Approaches to sample size determination includes one or a combination of methods such as:
Experimentation
During the Modeling Process, various factors such as Loss, Bias, Variance, and Accuracy can be monitored to determine the best sample size.
Larger is Better
Some Machine Learning Models, such as Artificial Neural Networks, perform better with large sample sizes. This is in part due to the number of network graph nodes that need to be trained during the model training process. However, sample size is just one of the modeling hyperparameters, so increasing samples size can be combined with modeling experimentation to achieve optimal results.
Statistical Power
The Statistical Power of a binary hypothesis test is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true.
Resampling
Resampling is the process of:
changing/exchanging data samples
identifying the impact of these changes on model and prediction characteristics
continuing until optimal results are achieved
Resampling can include methods such as:
Bootstrap - uses random sampling with data replacement
Jackknife - estimators of parameters are found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations
Label Exchange - classes associated with data samples are exchanged