Sunday, September 27, 2015

A Data Cleaning Example

For this particular example,

  • the variables of interest are stored as key:value pairs and
  • a single data cell could contain multiple (unknown) number of key:value pairs.
Basically, we want to convert input dataset on LHS to the output dataset on the RHS as illustrated in the graphic below -



The objective is to separate these key-value pairs and store the values in corresponding key columns.

The hadleyverse packages make this task a fairly simple one, especially tidyr, stringr and magrittr.

3 comments:

  1. A professional advice is always appreciated when needed and I have bookmarked your article
    carpet cleaning geelong
    furniture cleaning geelong

    ReplyDelete
  2. Cool you write, the information is very good and interesting, I'll give you a link to my site.
    best vacuums for hardwood floors
    best wet-dry vacuum

    ReplyDelete
  3. Thanks steadyfish for this, I've been looking for something like this. How would you handle data that has e.g. [ 7 breads, 5 pens, 10 eggs] as entry for products column and quantity in one cell under a particular shop. Then you'll need to use 'mutate' to create a revenue column e.g. Mutate(mydata, revenue=total product* price).

    Assuming you have a thousands shop each reporting different products.

    I am confused as to how to clean such messy data

    ReplyDelete