Tidy data for librarians

Formatting data tables in Spreadsheets

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • How should data tables be formatted in spreadsheets?

Objectives
  • Describe best practices for data entry and formatting in spreadsheets.

  • Apply best practices to arrange variables and observations in a spreadsheet.

The most common mistake made is treating the program like it is a notebook

Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).

This is why it’s extremely important to set up well-formatted tables from the outset

Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.

Keeping track of your analyses

When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when your leadership team ask for a different analysis, you must

This might be an example of a spreadsheet setup:

spreadsheet setup

Put these principles in to practice today during your Exercises.

Structuring data in spreadsheets

The cardinal rules of using spreadsheet programs for data:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
  2. Put each observation in its own row.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Leave the raw data raw - don’t mess with it!
  5. Export the cleaned data to a text based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.

For instance, we have data from a survey of small mammals in a desert ecosystem. Different people have gone to the field and entered data in to a spreadsheet. They keep track of things like species, plot, weight, sex and date collected.

If they were to keep track of the data like this:

multiple-info example

the problem is that species and sex are in the same field. So, if they wanted to look at all of one species or look at different weight distributions by sex, it would be hard to set up the data to do this. If instead we put sex and species in different columns, you can see that it would be much easier.

Columns for variables and rows for observations

The rule of thumb, when setting up a datasheet, is columns = variables, rows = observations, cells = data (values).

So, instead we should have:

single-info example

Exercise

We’re going to take a messy version of some library training data and clean it up

Important

Do not forget of our first piece of advice: create a new file (or tab) for the cleaned data, never modify the original (raw) data.

After you go through this exercise, we’ll discuss as a group what you think was wrong with this data and how you fixed it.

An excellent reference, in particular with regard to R scripting is

Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.

Key Points