Fundamentals of Research: Data Cleaning

Posted by  Derek Jones

POSTED ON  April 28, 2020

CATEGORIES  Learn

Now that you’ve collected data from your respondents, it’s important that you clean your data before embarking on your analysis. Don’t just start creating cross tabs and other analyses. Take some time to look at your data. Does it make sense? Are there inconsistencies, missing data or “rubbish” answers? Here is a guide to the most common things you should be looking for.

Missing data

This is where you find that there are holes in your data because there is no data in a question for all or some of the respondents you intended to ask. This can happen for several reasons:

  • You may have made a mistake with the skip logic, inadvertently skipping respondents over a question they should have asked (or vice-versa), or
  • You didn’t make some or all the questions mandatory and respondents chose not to answer them.

As a side note, you should really try to mitigate both of these occurring upfront by making all questions mandatory. Alternatively, just include a “refused” or “prefer not to answer” code if you think it might be sensitive to some, and by thoroughly testing your skip logic before fielding. It’s a good idea to also run a hole count after the first 30 or so responses are completed to ensure that all questions are being answered as intended.

Regardless, once you have completed your fieldwork and have all your responses, you should run a hole count of responses and check every question. A hole count is just a frequency count of each question by the responses with natural filters applied to each question. Natural filters mean that you are filtering the question’s responses to those who you intended to ask. For example, if your question was an “ASK ALL” such as a gender question, then you would have no filter. If your question was about income and asked only of those who work, then you would filter the responses to only those who work. Once you have your hole count, check the responses to each question, make sure they make sense and that every intended question has been asked. If not, you will need to discard, replace or go back to the respondent (re-fielding), all of which will cost more time and money.

Rubbish answers

Another thing to look out for is “rubbish” answers. The place to identify these is within any open-ended questions. Look for answers that don’t make sense. These can come in several forms:

  • Answers that are just so brief as to be suspicious, such as one-word answers to a “why” question
  • Answers that contain non-words as the respondent has just “hammered” the keyboard to get past a mandatory question
  • Answers that contain expletives, which happens more than you might think.

If you do find any of these in your data set, then consider checking the other responses for that case, it may be all rubbish and you might want to consider throwing that case out and/or replacing it.

Outliers

Also check for outliers in your data. These are responses that are at the extremes of the possible answers, particularly numerical responses such as wages, prices paid or frequencies of behaviours etc. Outliers can impact your mean (average) scores and skew your data if they are very high or low. Consider removing or checking any outliers as a standard part of cleaning your data.

Speeding tickets

Another thing to look out for is “speeding”. This is when a respondent has completed a survey in such record time that you might become suspicious about the level of consideration that was given to their answers. They may have just clicked anything to get through.

This is especially important to look for when incentives to complete a survey are offered, such as by permission-based panels that provide incentivised respondents for market research. Some respondents may find it tempting to just click anything to complete the survey with the minimum effort to receive their incentive. It’s not common but it does happen.

Most permission-based respondent panels will welcome you reporting these respondents, so they are warned and potentially excluded from participating in future paid surveys. The panels will usually offer you a replacement as well or, at a minimum, not charge you for that response.

delete-button

Some good ways to identify speeding include:

  • Checking for survey “straightlining”. That’s where a respondent gives the same answer to a number of statements within a grid question such as, “how much do you agree or disagree with each of the following statements?”
  • Adding a timestamp at the beginning and end of the survey so you can then filter and check the surveys that were completed in a time you would consider too fast.

Culling and oversampling

All this cleaning can lead to you excluding a certain number of cases, so it’s a really good idea from the get-go to oversample by at least 10% to allow for the removal of some respondents.

A lot of these issues can be mitigated with good testing (including hole counts) and considered setups (e.g. making questions mandatory), but no matter how careful you are, there will always be a need to check and clean your final data set. It’s an important step, so make sure you include it as part of your research process.


Share this post

Scroll Up