This post sponsored by Nappy Roots and Greg Steet…solid coding tunes.
I’ve been finding the package ‘RODBC’ insanely useful for integrating data management with data analysis.
A basic SQL query carried out through R looks something like this:
require(RODBC) channel = odbcConnect(dsn=FakeServerName;, uid=FakeUID, pwd=FakePW) sqlcmd = paste("SELECT *", "FROM [FakeDataBaseName].[dbo].[fake_table_name]") df = sqlQuery(channel,sqlcmd) close(channel)
Admittedly, part of the reason I like this particular workflow (pulling data directly from the server into R) is because I have a sizable code investment in data analysis and data visualization in R. However, I have noticed that some basic data-massaging (such as merging two tabled with a common field seem to work a little faster when carried out as an SQL join versus pulling two data frames in R and merging them).
An example (Unfortunately, the databases I work most often with are not open access so I’ll have to do this in pseudo-code fashion):
Meta Setup
- Microsoft SQL Server Management Studio…which I use primarily to explore database tables, fields, data types, etc.
- R version 3.0.2 “Frisbee Sailing”
- Basic hardware running Windows 7
Setup
- A database called FISHING with several tables containing information on commercial fishing trips taken along the U.S. West Coast living on a server called server1.
- A table called fishing_trips containing a vessel identifier, a trip identifier, and information on various fish species caught on that fishing trip.
- A second table called coastguard_info containing a vessel identifier and information about each vessel.
Suppose fishing_trips looks like:
vessel_id | trip_id | depart_port | return_port | fishery |
---|---|---|---|---|
1 | 1 | zebra | zebra | salmon |
1 | 2 | zebra | zebra | crab |
2 | 1 | waylon | jennings | crab |
3 | 1 | lagunitas | lagunitas | albacore |
3 | 2 | lagunitas | rogue | crab |
3 | 3 | rogue | rogue | albacore |
and coastguard_info look like:
vessel_id | length | weight |
---|---|---|
1 | 30 | 19 |
2 | 34 | 17 |
3 | 45 | 20 |
…and if you’re wondering where those port names came from that would be all the stuff starring at me from my coffee table at the moment.
The Data Import
The key to joining these tables server side and importing the merged table into R is to use R’s ‘paste’ function to construct a text string for the SQL query. In this, that string is constructed as,
sqlcmd=paste("SELECT [FISHING].[dbo].[fishing_trips].[vessel_id]", ",", "[FISHING].[dbo].[fishing_trips].[trip_id]",",", "[FISHING].[dbo].[fishing_trips].[depart_port]",",", "[FISHING].[dbo].[fishing_trips].[return_port]",",", "[FISHING].[dbo].[fishing_trips].[fishery]",",", "[FISHING].[dbo].[coastguard_info].[length]", "FROM [FISHING].[dbo].[fishing_trips]", "INNER JOIN [FISHING].[dbo].[coastguard_info]", "ON [FISHING].[dbo].[fishing_trips].[vessel_id] =[FISHING].[dbo].[coastguard_info].[vessel_id]")
The resulting variable ‘sqlcmd’ should look like the SQL query you want to execute…in this case:
SELECT [FISHING].[dbo].[fishing_trips].[vessel_id], [FISHING].[dbo].[fishing_trips].[trip_id], [FISHING].[dbo].[fishing_trips].[depart_port],[FISHING].[dbo].[fishing_trips].[return_port],[FISHING].[dbo].[fishing_trips].[fishery],[FISHING].[dbo].[coastguard_info].[length],FROM [FISHING].[dbo].[fishing_trips],INNER JOIN [FISHING].[dbo].[coastguard_info],ON [FISHING].[dbo].[fishing_trips].[vessel_id] =[FISHING].[dbo].[coastguard_info].[vessel_id]
To use this query to pull your data into R use:
require(RODBC) #set up the connection channel = odbcConnect(dsn='server1',uid='aaronmamula',pwd='mintyhippo') #set up the query sqlcmd=paste("SELECT [FISHING].[dbo].[fishing_trips].[vessel_id]", ",", "[FISHING].[dbo].[fishing_trips].[trip_id]",",", "[FISHING].[dbo].[fishing_trips].[depart_port]",",", "[FISHING].[dbo].[fishing_trips].[return_port]",",", "[FISHING].[dbo].[fishing_trips].[fishery]",",", "[FISHING].[dbo].[coastguard_info].[length]", "FROM [FISHING].[dbo].[fishing_trips]", "INNER JOIN [FISHING].[dbo].[coastguard_info]", "ON [FISHING].[dbo].[fishing_trips].[vessel_id] =[FISHING].[dbo].[coastguard_info].[vessel_id]") #execute the query saving the result as a local data frame called 'df' df = sqlQuery(channel, sqlcmd) #close connection close(channel)
With the resulting data frame:
vessel_id | trip_id | depart_port | return_port | fishery | length |
---|---|---|---|---|---|
1 | 1 | zebra | zebra | salmon | 30 |
1 | 2 | zebra | zebra | crab | 30 |
2 | 1 | waylon | jennings | crab | 34 |
3 | 1 | lagunitas | lagunitas | albacore | 45 |
3 | 2 | lagunitas | rogue | crab | 45 |
3 | 3 | rogue | rogue | albacore | 45 |
There’s noting magical about R here by the way. Matlab and STATA offer similar abilities to nest SQL code in script to automate data pulls from a database. I’v been particularly fond of R lately because of the fine level of control one has in displaying data.
When I’m not working with really massive datasets, I like to start each new piece of analysis by cranking out a plethora of plots to get the flavor of the data. The package ggplot2 is one of the best graphing packages I’ve come across. It has a logic that should be particularly appealing to those who work with mapping or spatial analysis software. One creates plots by defining a plot space and adding data in layers.
The cornerstone to a lot of my work is a script I created years ago (and add to regularly) which contains a number of plotting function. I’m particular about what plots I like and how I like them to look so I insert a pretty simple line in the preamble to all my R scripts:
source("myplots.R")
which allows me access to all my favorite plots. I wrote a lot of these plotting functions to use the aforementioned ggplot2 package. The plotting types in this package usually accept arguments in the form of data frames….which brings us back to the utility of ‘RODBC’: some of the databases I work with have a large number of tables. Different parts of the analysis might require information from 5 or so different tables. When I want to know what the data in a table look like, I import the table (or parts of it) and pass the table to the functions in my personalized ‘myplots.R’ script to generate some visualizations of the data.
I could carry these tasks out by:
1. running the SQL query inside the Microsoft SQL Server Management Studio.
2. passing the results of the query to a .csv or .txt file.
3. reading the .csv/.txt file into R.
4. plot data
Using ‘RODBC’ I’m able to turn this 4 step process into a 2 step process. Considering the number of times I find myself wanting to know what the data in a particular database table look like, shaving 2 steps off the process is big time.
As a setup for a future post of data visualization using ggplot2, here’s an example of a rather simple plot matrix that I use to visualize output from a financial simulation script I wrote in R.