Just Another Data Blog: 2014

Tuesday, April 15, 2014

Accessing Open Data Portal (India) using APIs

EDIT: I've wrapped up this code into an R package. You can find more info about it on this blog post and here on GitHub.

As I mentioned in my previous blog post, Government of India have started an Open Data Portal for making various data public. Most of the data-sets on the portal are available for manual download. Some of the data-sets though are also available to be accessed using APIs. In this post, I'll go over how to access that data using APIs (specifically JSON API) in R.

Again, the variety of R packages available makes this a not so difficult task. I've made use of mainly these packages - XML, RCurl, RJSONIO, plyr.

The complete process can be described in following steps -

Get the resource id, API key to make the API call
Recursively call API until all the data is obtained
Append all the data creating a single data-set.

Now, I'll describe in details each of the above steps. The resource id is the identifier for the dataset and can be found on the website (For e.g. resource-id 6176ee09-3d56-4a3b-8115-21841576b2f6 refers to dataset on the pin-code details). Another mandatory detail when making an API call is the API key. This key can be obtained by signing up on this data portal. Thus, the API URL would look something like this -

http://data.gov.in/api/datastore/resource.json?resource_id=6176ee09-3d56-4a3b-8115-21841576b2f6&api-key=<your API key>

The content of this URL can be downloaded into R by using getURL() function. Currently, there's a limit of 100 elements that can be downloaded in a single API call. This necessitates the 2nd step - making recursive API calls until all elements have been downloaded. For accomplishing this we can add one more offset parameter to the URL. The URL would now look like -

http://data.gov.in/api/datastore/resource.json?resource_id=6176ee09-3d56-4a3b-8115-21841576b2f6&api-key=<your API key>&offset=1

Here offset signifies the number of calls. For e.g. if in each call we are downloading 100 data elements; after downloading the 1st set of 100 elements, we'd specify offset=1 to download elements 101-200.

The data thus obtained using the recursive API calls can be converted to data.frame using ldply() and each data.frame can be combined into a master data.frame using rbind().

Following GitHub Gist describes the actual R code. You can also look at my GitHub project to proper understand the directory structure used in the code.

	####Download the data from Government of India open data portal#####
	w_dir = getwd()
	source(file=file.path(w_dir,"Code/Core.R"))
	checkAndDownload(c("XML","RCurl","RJSONIO","plyr"))

	### Alternative - 1: Using APIs ###
	#JSON#
	getJSONDoc <- function(link, res_id, api_key, offset, no_elements){
	jsonURL = paste(link,
	"resource_id=",res_id,
	"&api-key=",api_key,
	"&offset=",offset,
	"&limit=",no_elements,
	sep="")
	print(jsonURL)
	doc = getURL(jsonURL)
	fromJSON(doc)
	}
	getFieldNames <- function(t){
	#t: list
	names(t[[4]])
	}
	getCount <- function(t){
	#t: list
	t[[3]]
	}
	getFieldType<-function(t){
	t[[4]]
	}
	getData <- function(t){
	t[[5]]
	}
	toDataFrame <- function(lst_elmnt){
	as.data.frame(t(unlist(lst_elmnt)), stringsAsFactors = FALSE)
	}
	acquire_x_data <- function(x,res_id,api_key){
	currentItr = 0
	returnCount = 1
	while(returnCount>0){
	JSONList = getJSONDoc(link="http://data.gov.in/api/datastore/resource.json?",
	res_id=res_id,
	api_key=api_key,
	offset=currentItr,
	no_elements=100)
	DataStage1 = ldply(getData(JSONList),toDataFrame)
	print(currentItr)
	print(is(DataStage1$id))
	returnCount = getCount(JSONList)
	if(currentItr == 0) {
	returnData = DataStage1
	returnFieldType = ldply(getFieldType(JSONList),toDataFrame)
	}
	else if(returnCount > 0) returnData = rbind(returnData, DataStage1)
	print(currentItr)
	print(is(returnData$id))
	currentItr = currentItr + 1
	}
	list(returnData,returnFieldType)
	}



	#get the resource list file
	#(it has resource names and resource ids used for the API call)
	resourceList = read.table(
	file=file.path(w_dir,"Data/goi_api_resource_details.csv"),
	header=TRUE,
	sep=",",
	as.is=TRUE)

	api_key = read.table(
	file=file.path(w_dir,"Data/goi_api_key_do_not_share.csv"),
	header=TRUE,
	sep=",",
	as.is=TRUE)

	#make the API call
	res = subset(resourceList, resource_name == "pincode")
	pincodeDetails = acquire_x_data(x = res[1], res_id = res[2], api_key = api_key)

	save(pincodeDetails, file=file.path(w_dir,"Data/pincodeDetails.RData"))

view raw GOIDataInput.R hosted with ❤ by GitHub

Sunday, February 2, 2014

Know India through Visualisations - 1

I'm going to produce just a couple of charts, a teaser of sorts in this post. In the forthcoming posts I'll dig deeper.

I was amazed with the existing list of R packages to work with spatial data, without needing to get into much of the technical details. Various R packages I've used are described along with the code.

I've obtained the state level power supply position data for the November 2004 (just a random choice) from the data portal of the government of India website. The spatial data for India with state boundaries was obtained from Global Administrative Areas website.

Above plot is generated using spplot() function from sp package, below is a similar plot generated using ggplot() function from ggplot2 package. In the plot, darker shades of blue signify higher severity of electricity shortage and lighter shades signify lower severity as can be seen from the legend. The numbers in the legend are in MU i.e. Million Units (equivalent to gigawatt hour).

The advantage of using ggplot() is that I can add additional layers onto this map easily. For e.g. I can add labels of the states as can be seen below.

The R Code for this post is shown below and can also be found on this GitHub Gist.

	######## Packages ########
	checkAndDownload<-function(packageNames) {
	for(packageName in packageNames) {
	if(!isInstalled(packageName)) {
	install.packages(packageName,repos="http://lib.stat.cmu.edu/R/CRAN")
	}
	library(packageName,character.only=TRUE,quietly=TRUE,verbose=FALSE)
	}
	}

	isInstalled <- function(mypkg){
	is.element(mypkg, installed.packages()[,1])
	}

	packages <- c("sp","ggplot2","plyr","rgeos","maptools","sqldf","RColorBrewer")
	checkAndDownload(packages)
	# 'sp' a package for spatial data
	# 'plyr' required for fortify which converts 'sp' data to
	#polygons data to be used with ggplot2
	# 'rgeos' required for maptools
	# 'maptools' required for fortify - region


	################ Data Input ##############

	#get the nation-wide map data with state boundaries
	con <- url("http://biogeo.ucdavis.edu/data/gadm2/R/IND_adm1.RData")
	load(con)
	close(con)

	#get the data from indian government data website about powerSuply position for a particular month
	data_url = "http://data.gov.in/access-point-download-count?url=http://data.gov.in/sites/default/files/powerSupplyNov04.csv&nid=34321"
	nov04 <- read.table(
	file=data_url,
	header=TRUE,
	sep=",",
	as.is=TRUE)
	power = nov04

	#get the state codes file
	#(data dictionary with indian states mapped to a 2 letter state code)
	stateCodes <- read.table(
	file="D:/JustAnotherDataBlog/Data/IndiaStateCodes.csv",
	header=TRUE,
	sep=",",
	as.is=TRUE)

	########## Adhoc Data cleaning #########

	#renaming the important columns
	colnames(power)
	colnames(power)[1] = "State"
	colnames(power)[2] = "Demand"
	colnames(power)[3] = "Supplies"
	colnames(power)[4] = "Net Surplus"
	colnames(power)
	#adjusting power supply position in these states based on geographic area
	#area numbers obtained from wikipedia
	WB_Area <- 88752
	Sikkim_Area <- 7096
	Jharkhand_Area <-79714
	WB_Sikkim_Prop <- WB_Area / (WB_Area + Sikkim_Area)
	WB_Jharkhand_Prop <- WB_Area / (WB_Area + Jharkhand_Area)

	#1.West Bengal and Sikkim to be split in some ratio:ratio of areas
	WB1 <- power[power[, "State" ]== "W B + Sikkim" , -1] * WB_Sikkim_Prop
	SK <- power[power[, "State" ]== "W B + Sikkim" , -1] - WB1

	#2.DVC numbers to be divided between Jharkhand and West Bengal: by ratio of areas
	WB2 <- power[power[, "State" ]== "DVC" , -1] * WB_Jharkhand_Prop
	power[power[, "State" ]== "Jharkhand" , -1] <- power[power[, "State" ]== "Jharkhand" , -1] +
	power[power[, "State" ]== "DVC" , -1] - WB2
	WB = WB1 + WB2

	#3. Tripura: not present in some of the datasets

	#4. Andaman and Nicobar, Lakshdweep not part of the analysis

	power = rbind(power,cbind(State="West Bengal",WB),cbind(State="Sikkim",SK))


	######### Preparing data for visualisation ###########

	#merge power data with state codes file
	power_stateCodes <- sqldf("select a.,b. from stateCodes as a inner join power as b on a.State = b.State")

	#cross check
	sum(power_stateCodes$Demand) == power[power[, "State" ]== "All India" , 2]
	power_stateCodes$Net_check = power_stateCodes$Supplies - power_stateCodes$Demand
	sum(power_stateCodes$Net_check - power_stateCodes$Net_Surplus)

	as.data.frame(gadm)
	gadm@data
	states <- as.data.frame(gadm@data$NAME_1)
	colnames(states) ="State_gadm"
	states_stateCodes <- sqldf("select a.,b. from states as a
	left join stateCodes as b on a.State_gadm = b.State")

	power_states <- sqldf("select a.State_gadm, a.Code, b.Demand,
	b.Supplies, b.Net_Surplus from states_stateCodes as a
	left join power_stateCodes as b on a.Code = b.Code")
	power_states$log_Deficit <- log(-power_states$Net_Surplus+1)
	breaks = quantile(power_states$log_Deficit,na.rm=TRUE)
	breaks = 1 - exp(breaks)
	power_states$Severity = cut(power_states$Net_Surplus,breaks,include.lowest =TRUE)
	#with(power_states,power_states[Code == 'MH',])
	#crosscheck as we can't merge using name
	sum(gadm@data$NAME_1==power_states$State_gadm)==nrow(gadm@data)

	gadm <- spCbind(gadm,power_states)
	plotclr <- rev(brewer.pal(length(levels(gadm@data$Severity)),"Blues"))

	#using spplot
	png(file="D:/JustAnotherDataBlog/Plots/FirstPost_PS_Pos_Nov_04_Ind_spplot.png",width=500,height=400)

	spplot(gadm, "Severity",
	col.regions=plotclr
	,main="India: Net Power Supply Position for Nov 2004 (MU)")#, col="transparent")

	dev.off()

	#using ggplot

	india <- fortify(gadm, region = "NAME_1")
	#india.df <- join(india, gadm@data, by="NAME_1") doesn't work because id cols are different
	names(india)
	temp <- gadm@data
	india.df <- sqldf("select a.* ,b.* from india as a
	left join temp as b on a.id = b.Name_1")#it's not case sensitive
	S_Code <- aggregate(cbind(long, lat) ~ Code, data=india.df, FUN=function(x)mean(range(x)))
	S_Code <-ddply(india.df, .(Code), function(df) kmeans(df[,1:2], centers=1)$centers)
	png(file="D:/JustAnotherDataBlog/Plots/FirstPost_PS_Pos_Nov_04_Ind_ggplot.png",width=500,height=400)

	p <- ggplot(india.df,aes(long,lat)) +
	geom_polygon(aes(group=group,fill=Severity),color='white') +
	geom_text(data=S_Code,aes(long,lat,label=Code), size=2.5) +
	#geom_path(color="white") +
	coord_equal() +
	scale_fill_brewer() +
	ggtitle("India: Net Power Supply Position for Nov 2004 (MU)")
	p
	dev.off()

view raw FirstPostOnGeographicalMaps.R hosted with ❤ by GitHub

Saturday, February 1, 2014

Introduction

Through this blog I intend to work on some data analysis projects, publish the results here and get feedback from other data experts. I have exposure to R, Python, MATLAB and would be using suitable one of them based on needs. My interests range from Web scraping to Data visualization to Statistical modelling.