Natural Gas Distribution State Pipeline Analysis PART 2: R for Summary Tables

Since I have now done a similar analysis for two states using the same data set from the Pipeline and Hazardous Materials Administration (PHMSA), I figured that for the third state we’re working in, I would try to do the same analysis using R so I could automate the process- enabling us to select any state and get the numbers we wanted. My goal was to create an R script that would take input from the user for the year and state they wanted and would create an Excel spreadsheet with the summary tables for all the numbers we want summarized.

I started by playing around in RStudio to see if I could read in my data- which I had to open in Excel and re-save as a csv- and then try to play around with the code to start doing the things I did in Excel.

I took an R class on Coursera a couple years ago, but it was very challenging and I didn’t remember much from it so I didn’t get very far. Luckily my friend Alison @linkalis is good at data analysis in R and happy to share her vast knowledge. We met up for coffee and I walked her through my previous analysis, explaining what I was hoping to do. She hooked me up with some basics- she showed me the dplyr library and how to use it, and she made me an R script with examples of some of the things I described I’d want to do. I took it and ran. Thanks Alison!!

Step One: Think Through the Goal

I looked at the analysis I had done using Excel to pull together some summary tables. I drew out what tables I wanted to get a clear idea of what I wanted to create and what data I would need to pull together. I wanted nine tables: number of state pipeline operators, total state main pipeline mileage, total state number of services, pipeline mileage by operator, services by operator, mileage of main pipelines made from leak-prone materials by operator, number of services made from leak-prone pipelines by operator, number of leaks in main pipeline by operator, number of leaks in services by operator.

Step Two: Some Clean-Up and Formatting

I started used the code in the sample script to figure out how to create the tables. There were a few munging tasks that needed to be figured out to clean up my tables:

The STOP field, which is used to identify the state in which the operator is operating the pipelines covered in the data set, had to be read in as a character so that I could use a filter. The R default was to read it in as a factor variable.
There were some extra rows that read in as NAs, so I incorporated some NA removal using the unique ID for each operator: filter(!is.na(OPERATOR_ID)
Once I had my summary tables, I removed the rows where all the entries in the columns we were interested in were 0’s since I knew from looking at the tables in Excel that many operators did not have any miles of pipeline made of the materials we were looking at: filter(OP_TOTAL!=0)
To order the table with the largest number at the top (we’re interested in the largest numbers), I used arrange(desc(OP_TOTAL))

Step Three: Get the Code to Make the Tables

To start making the tables:

I assigned some global variables for year and state to use as placeholders that would later become user input.
To read in the data, I used read.csv: gasdata_full <- read.csv("annual_gas_distribution_2014.csv")
To make a table with a ubset of the fields available, I call those fields by their headers. Here, I include the NA removal and then filter by state (in this case, Indiana), and create a table that shows mileage of main pipeline made of unprotected bare steel, unprotected coated steel, cathodically protected bare steel, and cast iron for each operator. Doing this in dplyr allowed me to string all those operations together using the %>% command: in_main_materials_wide <- gasdata_full %>% filter(!is.na(OPERATOR_ID) & STOP=="IN") %>% select(OPERATOR_NAME, OPERATOR_ID, STOP, MMILES_STEEL_UNP_BARE, MMILES_STEEL_UNP_COATED, MMILES_STEEL_CP_BARE, MMILES_CI)
Next I wanted to create a new field that is a summary of other fields; this example is the sum of all leak prone pipeline mileage that includes all four individual categories. I used ‘group by’ in dplyr to select entries by Operator Name and use ‘mutate’ to add a column to the table that would be the calculated sum of all the pipeline materials we’re looking at: materials_summary <- materials %>% group_by(OPERATOR_NAME) %>% mutate(OP_TOTAL = sum(MMILES_STEEL_UNP_BARE, MMILES_STEEL_UNP_COATED, MMILES_STEEL_CP_BARE, MMILES_CI))
Challenge: using ‘arrange’ to sort the data by a total that was calculated using ‘group by’ in dplyr didn’t work. When I googled it, I found this resource on Stack Overflow that suggested that I had to ungroup the data to sort it. It worked! materials_summary <- materials_summary %>% ungroup() %>% filter(OP_TOTAL!=0) %>% arrange(desc(OP_TOTAL))

Step Four: Put the Code into Functions

When I began thinking about how to use my code in functions, I realized I had to make a cognitive jump from simply manipulating data frames to creating and assigning output of a function to a variable. Conceptually I struggled with making functions because I didn’t understand how to “put something into” the function, especially since I started without using any arguments. Eventually I understood that calling the function and assigning that call to a variable (even if there are no arguments) will create my table, which has the name of the variable.

Much of my analysis gets repeated, because it gets done for both main pipeline and services. I had this tendency to want to generate more than one value in output, which you can’t do in R, and then I realized I could use arguments here for either pipeline main or services. So I set up an if, else if in my functions that would allow me to run the same function for main and then services.

Here’s an example of the summary for total number of miles of main pipeline and total number of services:

# table2 is a summary of total mileage of pipe (main and services) by operator
table2 <- function(pipe){
  if (pipe == "main"){
    summary <- gasdata_full %>% filter(STOP== state) %>% select(OPERATOR_NAME, OPERATOR_ID, STOP, MMILES_TOTAL)
    summary <- arrange(summary, desc(MMILES_TOTAL))
  } else if ( pipe == "services") {
    summary <- gasdata_full %>% filter(STOP== state) %>% select(OPERATOR_NAME, OPERATOR_ID, STOP, NUM_SRVCS_TOTAL)
    summary <- arrange(summary, desc(NUM_SRVCS_TOTAL))
  }
  return(summary)
}

To call the functions with their arguments and assign them- here’s an example showing the tables created using the function above: one for mileage of main pipeline, and another for number of services:

Total_Main_Mileage <- table2("main")
Total_Services_Mileage <- table2("services")

Step Five: Tie the Functions Together and Write Output

Once I had all my functions written to generate the tables I wanted, I had to write a script that would create the tables and write them to an Excel file. I had to call the functions, for those with arguments I called them once for each argument, and assign the output to a set of variables. Then I wanted to write each of those tables to a different worksheet in an Excel workbook.

This seemed tricky, so naturally I googled that too. Luckily, someone had already done this exact thing! Over at statMethodsblog, there was a nice function that was created using the xlsx library that did just what I wanted to do. I copied the function and used it to write my tables to an Excel file. The function takes the names of the tables and uses them to name the worksheets in the workbook.

I used another little bit of code to generate the name of the workbook based on the year and state inputed by the user: wkbk_name <- paste(state, "_", year, ".xlsx", sep ="")

The last part was figuring out how to allow the user to input the year and state they choose. I wrote a couple small functions using ‘read lines’ funtion; here is the one for year:

readlines_year <- function() { 
  year <- readline(prompt="Enter a four digit year (ex: 2013): ")
  return(year)
}

Another challenge: I programmed this on my work laptop, and then tried to run it on my personal laptop and I got a Java error. Because the xlsx library uses Java, to run the R script you need an updated version of Java. Go figure.

Finally, I checked the summary numbers against the analysis I had done using Excel and Tableau. They were the same.

Like Magic

Once everything was put together, I can simply run the R script, enter the year and the state I want at the prompts, and an Excel workbook with the nine summary tables is created in my working directory. Magical! In the future, I’d like to keep working with this data set and look at some of these numbers over time (the data sets go back to the 1970s) or look at state comparisons. I have also started working on plotting this data using R. Stay tuned for more on this project!

Bonus: Transforming Data

As I talked about in Part 1, one of the tasks that I did to make the stacked bar charts in Tableau was to take data in the wide form and transform it to the long form. Alison also showed me how to do this using the reshape2 library in R. But more about that in a later part of the project, when I work on plotting.

Read Part 1 of this project