Sunday, September 27, 2015

A Data Cleaning Example

For this particular example,

  • the variables of interest are stored as key:value pairs and
  • a single data cell could contain multiple (unknown) number of key:value pairs.
Basically, we want to convert input dataset on LHS to the output dataset on the RHS as illustrated in the graphic below -

The objective is to separate these key-value pairs and store the values in corresponding key columns.

The hadleyverse packages make this task a fairly simple one, especially tidyr, stringr and magrittr.

Thursday, August 13, 2015

Survival Analysis - 2

In my previous post, I went over basics of survival analysis, that included estimating Kaplan-Meier estimate for a given time-to-event data. In this post, I'm exploring on Cox's proportional hazards model for survival data. KM estimator helps in figuring out whether survival function estimates for different groups are same or different. While survival models like Cox's proportional hazards model help in finding relationship between different covariates to the survival function.

Some basic notes about Cox model -

  • It's a semi-parametric model, as the hazard function (risk per unit time) does not need to be specified.
  • The proportional hazard condition states that the covariates are multiplicatively related to the hazard. (This assumption should always be checked after fitting a Cox model). 
  • In case of categorical covariates, the hazard ratio (e.g. treatment vs no treatment) is constant and does not change over time. One can do away with this assumption by using extended Cox model which allows covariates to be dependent on time.
  • The covariates are constant for each subject and do not vary over time.

(There's one-to-one mapping between hazard function and the survival function i.e. a specific hazard function uniquely determines the survival function and vice versa. Simple mathematical details on this relationship can be found on this wikipedia page.)

I'm using the same datasets (tongue dataset from package KMsurv and a simulated dataset using survsim) and set of packages as used in the previous post - OIsurvdplyrggplot2 and broom .

Sunday, August 2, 2015

Survival Analysis - 1

I recently was looking for methods to apply to time-to-event data and started exploring Survival Analysis Models. In this post, I'm exploring basic KM estimator. It is a nonparametric estimator of the survival function. There are couple of instances when the KM estimator comes in handy -
  • When the survival time is censored
  • Comparing survival function for different preassigned groups.

Below I'm computing KM estimator for a real dataset (on time to death for 80 males who were diagnosed with different types of tongue cancer, from package KMsurv) and a simulated dataset (created using package survsim). In addition I am using survivalOIsurv, dplyr, ggplot2 and broom for this analysis. The first example is taken from an openintro tutorial.

The rmarkdown document illustrating below analysis can also be found here. In my future posts, I'm planning to explore more on following survival models -
  • Proportional hazards model
  • Accelerated failure time Model
  • Multiple events model (More than 2 possible events)
  • Recurring events (Each subject can experience an event multiple times).

Monday, June 8, 2015

ogdindiar: R package to easily access Open Government Data from India Portal

Following up on my earlier posts on accessing Open Government Data from R, I've wrapped this code into an R package - ogdindiar. It's available on GitHub at

It provides one simple function - fetch_data() to download required data resource from the portal. You can find the details about the usage in this vignette.

Below is an example that downloads India's annual and seasonal mean temperature data using this package. You can also see it here.

Monday, March 16, 2015

dplyr Use Cases: Non-Interactive Mode

The current release of dplyr (v 0.4.1) offers lot more flexibility regarding usage of important verbs in non-interactive mode. In this post, I'm exploring different possible use-cases.

  • group_by_, select_, rename_:
For group_by_, select_ and rename_, we can pass a character vector of variable names as an argument to .dots parameter.

  • filter_:
To use filter_ function, we need to pass filter criteria as a parameter to .dots. The criteria can be created using lazyeval::interp function.

  • mutate_, transmute_, summarise_:
We need to provide 2 things to these functions - a list of functions to be applied on the input variables (with corresponding input variables) and a character vector of output variables names. These 2 things can be passed to the .dots argument using combination of lazyeval::interp  and setNames function.

  • joins:
For 2 table verbs, there's no *_join_ function and we don't need one for general purposes. We can just pass a named vector to by argument. setNames function comes in handy while doing this.

The R Code for the above mentioned use cases is shown below and can also be found on this GitHub Gist.

Tuesday, April 15, 2014

Accessing Open Data Portal (India) using APIs

EDIT: I've wrapped up this code into an R package. You can find more info about it on this blog post and here on GitHub.

As I mentioned in my previous blog post, Government of India have started an Open Data Portal for making various data public. Most of the data-sets on the portal are available for manual download. Some of the data-sets though are also available to be accessed using APIs. In this post, I'll go over how to access that data using APIs (specifically JSON API) in R.

Again, the variety of R packages available makes this a not so difficult task. I've made use of mainly these packages - XMLRCurl, RJSONIO, plyr.

The complete process can be described in following steps -
  1. Get the resource id, API key to make the API call
  2. Recursively call API until all the data is obtained
  3. Append all the data creating a single data-set.
Now, I'll describe in details each of the above steps. The resource id is the identifier for the dataset and can be found on the website (For e.g. resource-id 6176ee09-3d56-4a3b-8115-21841576b2f6 refers to dataset on the pin-code details). Another mandatory detail when making an API call is the API key. This key can be obtained by signing up on this data portal. Thus, the API URL would look something like this -<your API key>

The content of this URL can be downloaded into R by using getURL() function. Currently, there's a limit of 100 elements that can be downloaded in a single API call. This necessitates the 2nd step - making recursive API calls until all elements have been downloaded. For accomplishing this we can add one more offset parameter to the URL. The URL would now look like -<your API key>&offset=1

Here offset signifies the number of calls. For e.g. if in each call we are downloading 100 data elements; after downloading the 1st set of 100 elements, we'd specify offset=1 to download elements 101-200.

The data thus obtained using the recursive API calls can be converted to data.frame using ldply() and each data.frame can be combined into a master data.frame using rbind().

Following GitHub Gist describes the actual R code. You can also look at my GitHub project to proper understand the directory structure used in the code.

Sunday, February 2, 2014

Know India through Visualisations - 1

I'm going to produce just a couple of charts, a teaser of sorts in this post. In the forthcoming posts I'll dig deeper.

I was amazed with the existing list of R packages to work with spatial data, without needing to get into much of the technical details. Various R packages I've used are described along with the code.

I've obtained the state level power supply position data for the November 2004 (just a random choice) from the data portal of the government of India website. The spatial data for India with state boundaries was obtained from Global Administrative Areas website.

Above plot is generated using spplot() function from sp package, below is a similar plot generated using ggplot() function from ggplot2 package. In the plot, darker shades of blue signify higher severity of electricity shortage and lighter shades signify lower severity as can be seen from the legend. The numbers in the legend are in MU i.e. Million Units (equivalent to gigawatt hour).

The advantage of using ggplot() is that I can add additional layers onto this map easily. For e.g. I can add labels of the states as can be seen below.

The R Code for this post is shown below and can also be found on this GitHub Gist