Fundamentals of Research: Selecting and Controlling Sample

Posted by  Derek Jones

POSTED ON  April 21, 2020

CATEGORIES  Learn

If our role is about informing better decisions for marketers, brand custodians and policy makers through evidence-based research, then ensuring that the research we do can be inferred to the target population is critical! This is the realm of sampling and sample size.

When it comes to research, sample size does matter. But although it might be tempting to get as many responses as possible, it’s important to think about not only how many responses you get but how many of whom you get. This is where it’s important to consider how you will select and control the sample and the sample size.

Sampling theory is complex and based in statistics and probability and well beyond the scope of this post to explain in full detail. To simplify this, we have created a quick and easy to understand 5-point sampling process which we will discuss below:

  • Define the population. This is the group of interest being studied. It is usually made up of elements or sampling units (adults, voters, users of a certain product etc.).
  • Identify the sampling frame. This is a “list” of all the sampling units available for selection at the stage of the sampling process. The actual sample is drawn from this list.
  • Determine the sample size. How many respondents (n) will be included in the sample? e.g. n=1,000; n=500; n=200
  • Select the sample. Select and execute a specific procedure by which sample will be chosen – e.g. random versus interval method.
    Control the sample. Physically select sample quotas to further ensure your sample is representative.

Let’s talk about each of these in more detail.

people-walking-across-road

Define the population

Although defining the population appears straightforward, it is important to spend time to define exactly who you want to speak to. This is usually based on either:

  • a target consumer profile e.g. “in market” (consumers who purchased or are considering purchasing a product or service of interest),
  • a demographic such as age, gender or location, or
  • a psychographic, i.e. an attitude or opinion on something.

Identify the sampling frame

For your sample to be representative of the target population, you need to as best as possible generate a random sample from that population, effectively giving every member of the population the same chance or probability of being selected for the research. All our sampling error rates are calculated on this assumption.

In practice, though, you firstly need to have access to a sampling frame. This is a database or list of every member of the target population. These are often elusive and/or have limited access.

If you wanted to sample the entire Australian adult population properly, for example, you would need access to a list that has full enumeration of all adults such as the electoral roll. In practice, researchers generally don’t have access to these types of sampling frames for the general population.

Having said that, not all research is targeted at general consumers and, in practice, clients can often provide a sampling frame such as customer lists.

Regardless of the difficulties we currently face in terms of sampling frames, we think it’s important for researchers and research buyers to understand how sampling should be done from first principles, so here is the process.

Determine the sample size

Why does sample size matter? Sample size matters because, as a rule of thumb, the larger the sample, the more accurate your sample will reflect the target population – well that’s the theory.

But it’s not as simple as just bigger is better.

Firstly, there is the notion of sampling error. Sampling error is a calculation of the likely margin of error we will obtain from our sample when inferring those results back to the total population in question. Put more plainly, it’s the likely known difference between asking a certain number of randomly chosen people from a population to asking the entire population (a census).

We can calculate these error rates (also known as confidence intervals) for large populations (over 10,000) assuming two things: we have a sampling frame, and we have used probability sampling (we have randomly selected them) to choose potential respondents.

If this is the case, then it is true that the larger the sample, the lower the error rate and the more accurate our sample and research findings. There is, though, a point of diminishing returns on investment here. Although bigger is better, we reach an optimal level of accuracy at about n=1,000 (±3 percentage points at 95% confidence). To reach ±1%, we need to increase the sample size by tenfold, i.e. up to n=10,000, which is usually cost prohibitive. Thus, we often see polling surveys based on a sample of n=1,000.

Some common sample sizes and error rates at 95% confidence for large populations over 10,000 are shown below (smaller populations need to be calculated differently based on both the population size and the sample size).

Sample sizeSampling error
n=100±9.8 percentage points
n=300±5.6 percentage points
n=500±4.4 percentage points
n=1,000±3.1 percentage points
n=10,000±1.0 percentage points

If we don’t have a sampling frame and we just let anyone respond (this is known as a convenience or non-probability sample), it doesn’t matter how large the sample size, there is no known chance of selection and therefore there is no way to calculate how the sample may differ from the population value of interest (sampling error).

Put another way, we cannot make inferences about a total population; we can only wish or hope that they approximate the group of interest.

people-sitting-in-room

Select the sample

Once you have your sample frame and have determined the size of your required sample, you need to create a process to choose potential respondents. We prefer to use probability methods which mean ensuring that everyone in the sample frame has the same probability of being selected in the sample – this is also known as random selection. There are several ways of doing this, from simply using a random number generator to using an interval method. This is best explained using a simple example.

Say you have a list of contact details of 100,000 people in an Excel spreadsheet and you want to choose 10,000 for a sample to achieve a final sample size of n=1,000. Keeping in mind that response rates might be only 10% so you need to over-sample to ensure you get your required n. You could do one of two things:

Random number generator

In one of the columns, use Excel’s RAND function to create a random number. Copy this down the column so each case has a random number assigned to it. You can then convert this to text (so the number doesn’t change when you sort) and then sort the entire database based on that random number column. You could then select the first 10,000 cases in sort order, and you will have a randomly selected sample.

Excel formulaDescription
=RAND()A random number greater than or equal to 0 and less than 1
=RAND()*100A random number greater than or equal to 0 and less than 100
=INT(RAND()*100)A random whole number greater than or equal to 0 and less than 100

Interval method

The interval method is similar but has the advantage of selecting sample in true proportions by whatever variations exists (e.g. such as area, age or gender). It does this by selecting every nth case, effectively taking a cross section of the data. To do this using the same example on the left, you would divide the number in your sample frame (100,000) by the number you require (10,000 in this case) allowing for a response rate of 10%. This gives us an interval of 10.

As the method suggests, we would then select every nth case – in this example, every 10th case. To ensure that we still have a random selection (giving everyone the same chance) we start with a random start point generated within the interval of 1 to 10. So, we generate a random number between 1 and 10 – say 7 – and then select every 10th case from there, so the 7th, 17th, 27th etc.

Again, we end up with a random sample, but this time we know it maintains most of the existing proportions in the population. Although random number generators should also achieve this, it can occur sometimes that you get skews purely based on the randomness of the approach. We thus prefer to use interval method which avoids this occurring completely.

Control the sample

Unfortunately, achieving a representative sample doesn’t end with choosing a random sample from a sampling frame. We must also think about who responds to the survey and when they respond, as we know that different demographics don’t respond to surveys in a uniform manner. For example, older people respond more quickly and at higher rates than younger people, as do females more than males. Busier people are also less likely to respond quickly than less busy people.

There is no point having 70% females in your survey if you are doing a population survey where the actual proportion is closer to 50%. This will introduce a bias to your results. So, we need to spend more time getting the right sample than just getting the biggest sample.

The most common way of doing this is to introduce quotas. These are controls on the number of each respondent type you will accept in your sample. The most common form of quotas are gender, age and area. Quotas should be set to reflect proportions in the known population. So, for example, if we know that 50% of the population are female then we should set a quota of 50% to ensure they are not over-represented. Males too should be set at 50% to ensure they are not under-represented, and it is likely we will be chasing them late in the survey fieldwork period. The same process applies for age and area.

Quotas can be interlaced or non-interlaced. Interlaced just means that you have combined more than one variable into a quota, such as setting a quota of 15% males aged 18–24 living in Sydney, as opposed to just an overall (non-interlaced) quota for each of age, gender and location for the entire sample.

Interlaced quotas, although preferable, are harder to achieve and usually mean higher cost, so you need to make a call depending on your budget. Either way, you will end up with some quotas full before others and this will mean you stop accepting some respondents (usually older females) and chasing others (usually younger males). You will need to use reminders to get at harder to achieve sub-samples like younger males and/or include more of them in the original sample to ensure you get enough.

Sampling is complex, so take the time to make this a considered process and manage it whilst in field.


Share this post

Scroll Up