Summer Project: Creation of a Data Visualization Program using R

Submitted by SteveGabriel

Published: Mon, 07/13/2015 - 15:25 | Group(s):

Black Hills State University/SURF QuarkNet Center

Project Goal- To build an interactive website to visualize data taken from the weather detectors located at the 4850 level (4 Winze Wye, 17 Ledge, and Governor’s Corner) of SURF.

Day 1- 6/22/15

I received the main project overview today, and began my research.

Somebody else figured out how to plot the data in an interactive way, using the programming language “R”. However, we still do not quite understand how they did it, and are attempting to replicate the results. I have decided to take an online course, similar to Codeacademy, that will explain “R” and its usages better, as I am still a novice.

Scripts are called using the “source” function. For instance, if one had a script entitled “CommenceWorldConquest”, one would simply use “source(“CommenceWorldConquest.R”)”.

The next section explained vectors, which are lists of values. Vectors can be strings, Boolean values, or numbers, as long as all the values are of the same type. For instance a vector cannot contain 2 numbers and a string. Vectors are created using the “c()” function.

Vectors can also be created using “start:end” notation, which creates a list of data from the first number listed to the last number listed. “6:8” would generate “6,7,8”. Another way to create these lists is to use the sequence function. The format for the sequence function is “seq(start,end,increment)”. For instance, “seq(6,8,0.5)” would create a sequence of 6 numbers 6-8 that would increase by (0.5) each time.

If you store a vector in a variable, you can retrieve individual elements of the vector. For instance, if vector “6,7,8” was stored in variable “Counting”, one could type “Counting[2]”, which would return 7. This is similar to Java’s method of calling individual parts of a string, but different in that its values start at 1 instead of 0. You can also change the value of individual elements in the vector through this method. For example, “Counting[3] <- 9” would make the third number be 9 instead of 8. In addition to changing the values, you also can add values to the vector. Entering “Counting[4]<-9” would add 9 as a fourth value to the vector. It is also possible to access multiple values. “Counting[c(1,3)]” would return “6,8”. One can also use the start:end notation. “Counting[1:3]” would return 6,7,8.

The next lesson explained how to name elements. One does this by using the names command. For instance, “names(counting)<-c(“six”, “seven”, “eight”) would name 6,7,8 as six,seven,eight. Names are practically variables.

One can do vector math using R. After giving your vector values, you can add, subtract, multiply, and divide with ease. For instance, counting+1 would return 7,8,9. Counting*2 would return 12,14,16. You can also add vectors to vectors. For instance, if I had a second vector “B” equal to (4,3,5), “counting+B” would return (10,10,13). You can also compare vectors by using comparator operators, which will check the values and return either a “TRUE” or a “FALSE” value. Vectors can also be used with trigonometric functions.

The Plot function creates a scatterplot of the x and y coordinates. X and Y can be substituted for any variable, just so long as there is a variable.

It is possible for a value to not be available, in which case you put it in as “NA”. Using a vector with a “NA” value in a function will return “NA” unless told otherwise, using the “na.rm” argument. One would enter that value like so: “sum(a, na.rm)”.

One can make a matrix in R by using the “matrix()” function. The formula for matrix input is matrix(fields, length, width). For instance, a matrix “matrix(b,5,6)” would use the values of vector b to fill a 5X6 matrix

The “dim” function assigns dimensions to a vector. For instance dim(b)<- c(3,6) would cause vector b to turn into a 3X6 matrix.

To get a value from a matrix, one would use the same method as getting a value from a vector, but use 2 fields instead of 1.

One can also use this method to reassign values.

You can retrieve the values in a column by omitting the column

You simply don’t put anything into the column. To retrieve all values in a column, omit the row instead.

You can also use the above method combined with the “start:stop” format to retrieve all the values in multiple rows or columns. For instance plank[,3:4] would return all values in columns 3-4.

One can use the matrix to create powerful visualizations. For instance, using the contour function, one can create a contour map of data.

The median function finds the middle value of a vector ordered from least to greatest. One can also find the standard deviation using the sd function. One can then plot these results on a graph to see what a “normal” result would be.

The factor function categorizes types of data. It will show the levels, which will divide all the data into categories. The as.integers function will show how much of each type there is, and the levels function shows the categories.

When plotting data, the legend function will take an area of the graph, a vector with label names, and vector with plot character IDS (pch). It is a good idea to use the levels function, which will prevent you from having to go in and change everything if you add 1 variable.

The data.frame function creates a data table not unlike an Excel Spreadsheet

Just like matrices, one can collect individual elements from the table. For instance, to select the second column in the above matrix, one would type treasure[[2]] (BE SURE TO USE DOUBLE BRACKETS!!). You can also use the string name of the column such as treasure[[“weights”]]. However, a shorthand notation is also used. It is the data-table name, a dollar sign, and then the column name WITHOUT quotes. For instance treasure$prices would generate the same answer as treasure[[“prices”]].

The read.csv function allows one to import a csv file into a data table. CSV stands for Comma Separated Values. For files that have different separators than commas, such as slashes, one uses the read.table command. Using the sep argument, you can define the character used for separating items.

The V1 and V2 columns are due to the fact that a header was not specified. One can specify a header by adding the argument “header=TRUE”, which makes V1 and V2 go away.

The Merge function allows one to have a table with multiple Y values sharing the same X value.

Day 5-8 (Learning R)- 6/26/2015-7/1/2015

These last few days have been spent learning the ins and outs of the R programming language. I tried multiple online tutorials, such as the free one offered by Udemy and a course on Youtube. After I felt comfortable enough with R, I began using the Shiny package, which is what Chuck’s program is built on. I worked through the tutorial provided by the developers of Rstudio, and a video webinar provided by the same people. These provided the basic information that I felt I needed to attempt to work with Chuck’s script.

The above images are screencaps from the output of Chuck’s script. It creates 3 graphs of data that the user inputs, and a data table with it. So, I figured the first step of attempting to modify Chuck’s script was to simply get it to run. This turned into a quite grueling ordeal. At first, when I tried to run it, I realized that I needed to change the filepath, as his data was stored in a different directory than mine was. After this, I thought the script would run fine. However, whenever I ran it, it would return the least helpful error message I have ever seen.

“Replacement has 0 Rows, Data has 34257”.

I did not know how to fix it. I isolated the error message down to one line, but everything I would change in there would simply spit out different error messages. The next day, I was ready for another struggle to get that to work, but I was able to figure it out rather quickly. I opened up the data file I was using, and removed various parts of the header, and I also removed the spaces in the header. This was the whole problem. I was able to get the script to run flawlessly. The next step was to try to modify the script to achieve our goals in visualization:

· Different colors for each graph

· Multiple data sets on one axis (For instance, being able to graph airflow and air pressure on the same graph)

· 6 different data sets

· 2 graphs.

I started trying to change the color. This proved much easier than I imagined, as I simply had to add a “colour” argument to a small area of Chuck’s script.

I did that for each output, with a different color. The next step I attempted was to make each graph a readable line. I did this using the geom_line() command. In Chuck’s original script, he used geom_point(), which we thought looked more disorganized than a line. I then copied the above script and changed the inputs to create 6 different data sets, but also inadvertently creating 6 different graphs.

I was successfully able to have a graph that created multiple y outputs on one plot.

It is definitely much different from the original format, which only allowed one Y plot for the X plot.

I then created a second merged graph, for analysis purposes. I also created a second set of inputs.

My next step will be to try and create a legend for the graphs, as it can be hard to differentiate between the various elements of them.

7/1-7/7 (Trying to Create a Legend)

This task was much more difficult than I anticipated, probably taking more time than any of the other individual components of the shiny app. Every single approach I tried ended in unhelpful error messages. My frantic searching on the internet proved to be unhelpful. The only useful fact I found was that ggplot2 (the R package required to graph our data) was supposed to be creating legends automatically. Finally, after a week of searching, trying everything I found, I finally was able to create a legend. It is quite rudimentary, but it should at least give me a baseline to try and get one that suits our purpose better. The code that finally generated the legend is below:

A Dynamic Legend (or How I Learned Programming Can Be Like Beating Your Head Against a Wall. Repeatedly.)

7/7-7/10

These last three days have proven to be quite stressful. I have yet to make any true progress from my last entry, except for an incredible amount of trial and error. The only true accomplishment I had was changing the line size, and even that was somewhat difficult. I have been trying to do two things:

1. Create a legend that changes dynamically based upon user inputs

2. Give the graphs multiple Y axes, for analysis purposes.

The second one seems to be completely unsupported by the developers of ggplot2, the package we have been using to create the graphs. This is due to the fact that it goes against their philosophy, which seems incredibly arbitrary in my opinion.

My first goal also seems to be a feature that nobody else wants to use, as I was unable to find any documentation about it during the long hours that I spent researching these issues.

Eventually, we both gave up on trying to solve this, and emailed Chuck, the author of the original code, to see if he has any ideas. He pretty much reiterated what we learned in the last few days: both of our final goals are practically impossible in ggplot2.

7/13/15

We have shifted focus away from the dynamic legend, and have started working on separate projects. I have started looking for a way to allow the data to be imported from a webserver. My first step was getting my data onto the internet, using Google Drive and Dropbox. I then tried to load it back into R from those websites, and failed miserably. I continuously received a “duplicate row names not allowed” error. I take back what I said earlier; this is by far the least helpful error message in existence. After a Sisyphean effort, I finally was successful in isolating the error. It was a problem in the URLs themselves, which was remedied by the usage of the repmis package. This allowed me to read the file from Dropbox.

^The code I used to read the file

However, I cannot input velocity values now, as it generates another vague, unhelpful error message. I’m starting to think that programmers actively conspire to make error messages intentionally impossible to understand.

I was able to give the data a cursory analysis, and there are a few interesting things that keep cropping up. For instance, the Governors Corner and 17 Ledge graphs are incredibly similar.

In conclusion, we were successfully able to create an interactive program for data analysis. During my time spent here, I learned how to:

· Use R

· Use ggplot2

· Use Shiny

As I was told by a professor at BHSU, many research opportunities require that one build their own equipment and create their own programs, as the required materials will likely not have been made yet. I believe this project was incredibly valuable in that aspect, as it introduced me to the concepts of computer programming.

readtable_function.png

read_table_function.png

sinplot.png

barplot_abline.png