Archive | visualization RSS for this section

abstracting data: signal from noise

In keeping with the previous post about the need to incorporate creative thinking into science, here’s a novel approach that made the NYT headlines this morning.

Instance of a DataSet - Month 7, Floor 6, Pan

Brooklyn based artist Daniel Kohn is working with geneticists on conceptual tools to analyze their large data sets. Using intuitive, perceptual learning combined with his artistic approach to data reduction, Kohn is helping these scientists understand new ways to find signal amongst the noise. Starting with a collaboration in 2003 with research groups in Boston that led to a residency with the Broad Institute for Genetic research, at MIT and Harvard, Kohn’s interest in science expanded to include different mediums: painting, drawing, and computer modeling [1]. Since his pioneering collaboration with Broad, Kohn and several other artists have participated in their artist-in-residence program. More information about the program can be found on their webpage.

Shown in the image above, his series “Instance of a Dataset” culminated with a unique mural for the Broad Institute and an ongoing collaboration between artists and scientists. More images from this installation may be found in his gallery online[2]. Kohn’s work speaks to a type of perceptual thinking and visual learning that we all utilize in our daily lives, with the difference that he is putting these tools into experimental approaches to real world datasets. Call it what you will, creative, intuitive, perceptual, or visual learning, these methods are all part of a new approach to thinking about complex data in novel ways.

Here’s another example of sorting out the signal from noise in a simple dataset from my first thesis[3].

Prochlorococcus pop4 (blue)

Prochlorococcus pop4 (blue)

This is a flow cytometric histogram or density plot showing distinctly different populations of marine cyanobacteria from a station sampled off the coast of South America in 2008. The frozen vial of seawater was analyzed by running a small volume through a flow cytometer and the output is literally a cloud of dots like this. Each dot signifies a particle of a particular size with unique fluorescence properties. The goal is to quantify this mess and distinguish between the background noise and populations of interest. This is accomplished easily with out of the box image analysis software and careful knowledge of the properties inherent in the type of data you’re working with – but there’s still an intuitive nature to this analysis. It can be subjective and open ended when you are hand selecting groups of dots and making artificial cut-offs. There are no steadfast rules to this type of data analysis, and you must be the kind of scientist that can work with imperfect data.

I recently finished a 3-year fellowship working on a unique time-series dataset to extract patterns. Most of my work involved the application and understanding of statistical models. There was a lot of time in front of messy data. There were a lot of visual tools and head scratching. There were things like this simple heat-map that took too much time to construct, a lot more data points that you can imagine – but resulted in a visually interesting approach to think about a dataset[3].

Annual abundance of selected plankton groups from a Long-term data series. Plot by H.A.Wright using the statistical software R.

Heatmap plot showing annual abundance of selected plankton groups from a Long-term data series. Plot by H.A.Wright using the statistical software R.

Using this type of approach leaves the data open to interpretation in a sometimes fuzzy manner, but from the vast types of data and rapidly evolving software, there are new and beautiful ways to think about your science. I’ll feature innovative artists and scientists from time to time on this webpage. Feel free to comment or ask further questions about my previous work.

References:

  1. Kohn, Daniel, http://kohnworkshop.com/TextPage-GR-Broad.php
  2. Kohn, Daniel, Online Flickr gallery Commissions Broad Institute 2013, “Instance of a Dataset”  url: https://flic.kr/s/aHsjy5jxB8
  3. Wright, H.A., Biogeographical analysis of picoplankton populations across the Patagonian Shelf Break during austral summer, MS Thesis, 2010.
  4. Wright, H.A. MPhil thesis: Long-term variability of plankton phenology in a coastal, Mediterranean time series (LTER-MC), 2013.
Advertisements

To pre-process or not to pre-process (this is the data)

A short wiki def 1

Data pre-processing is an often neglected but important step in the data mining process. The phrase “Garbage In, Garbage Out” is particularly applicable to data mining and machine learning projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.

Let me preface this post with a working definition of process.  When I refer to processing these days, it usually means data. If I’m out at sea, I’ll refer to processing (samples) which means “running” samples through a machine to obtain output – or data in this case.  Data analysis involves two levels of processing – the initial step being sample processing and the second is data processing.  Thus, these two components result in meaningful information.  There is a third type of processing in data acquisition and analysis termed pre-processing.  In my own definition of this method, pre-processing is a way of “combing” data or cleaning up a complex and often messy raw data set so that clean statistical analysis can occur.  Raw, environmental data can contain errors, missing data points, negative values, etc. and when trying to discern specific relationships within a dataset, these issues can lead to errors in both graphical and statistical interpretation.

An example of pre-processing an original data file:

The need to pre-process the data arises for a number of reasons.  If any of you have worked with SAS input files, that’s a good example of pre-processing combed and analyzed data for the final purpose of inputting/sending a SAS-compatible data file for statistical analysis.  This is not quite as extreme, but just as function.  In this case, readability is one issue.  I have poor eyesight and need my data to show my specific results in a clear fashion.  Due to my primitive knowledge of R, I have not constructed elaborate time series data frames or tables that can quickly tabulate and spit out the numerical analysis.  Thus, the need for pre-processing in this instance serves a few purposes.  1) It meets my immediate needs of extracting specific information or subsets of data from a larger dataset, and 2) It will provide a framework for future data input for this project.  Since the data is historically maintained in this method, it’s a bridge between the old method and the new analysis approach.  In the image I’ve show the previous approach on the left, augmented with an extracted arithmetic mean which I’ve then used to translate into the pre-processed file shown on the right hand image.

What’s the difference? Why bother? Is it necessary? Well, to answer the first question, the difference is a clear layout of yearly mean values ordered by date and species such the information is more accessible.  When I import this data set into another analysis program such as R, Matlab, or GnuPlot, it makes life easier.  Why bother? If it makes your life easier, it’s always worth it. It may take a small amount of time to translate and setup a new data file, but it is worth every minute if it makes analysis clear.  If you run statistical tests on the data and have to hunt and peck through the original dataset, it’s a nightmare.  If you use a pre-processed file that contains only the subset of data you wish to analyze, the results of tests between parameters become clear. (This is my biased opinion of course!) and finally, is it necessary?  Scroll down to the section in this post entitled: Dealing with real-life data sets. I even love the title!

References

1. “Data pre-processing” last modified 13 December 2010, http://en.wikipedia.org/wiki/Data_Pre-processing
2.  Sastry, Nishanth R. in “Visualize your data with gnuplot” accessed 3 Feb 2011, http://www.ibm.com/developerworks/library/l-gnuplot/

phytoplankton from space

Phytoplankton bloom off the Patagonian Shelf – December 21st

Image source

Resources for oral and poster presentations in science

I am scheduled for my first department presentation next monday the 24th.  Although this will be an informal presentation, it inevitably requires a good deal of fiddling with the outline, a great deal of reading, and finally a small bit of nerves in anticipation of standing in front of your peers and colleagues within the field.  Presenting an oral, written or visual explanation of your research is a necessary and valuable part of the scientific process.  As I experienced at the 2010 AGU conference and through reading Randy Olson’s recent book, public speaking and communication (in general) is a unique skill required of a scientist.

It is both an honor and responsibility to be amongst a group of fellow scientists as an invited speaker.  As a graduate student, I believe it is extremely important to take advantage of each opportunity to speak whether formally or informally.  Therefore, I hope these resources will help myself and others to improve our presentations and provide succinct and well designed presentations.  Feel free to leave comments and suggestions if you would like to share related resources.

A few years ago I came across Dr. Purrington’s excellent guidelines for putting together oral and poster presentations.

This great advice is applicable to many fields for different levels of research.  I really enjoy Dr. Purrington’s approach to tackling this subject.  There is also an extremely useful flickr group which allows you to fearlessly post your poster image (in progress or finished) and receive constructive comments.  I find that is the hardest part to constructing either an oral or poster presentation – the feedback part! Sometimes I wish that I had placed an anonymous comment envelope next to my poster or allowed people to give feedback after a presentation.

  • I’d also like to highlight an invaluable resource in the field of oceanography – presented by The Oceanography Society: TOS Scientifically Speaking . This is an excellent reference point for both early and established career scientists.  I prefer to consider the points outlined in this guide as the gold standard of expectations within my field.  I have shared printed versions and pdf’s of this booklet with students as well as colleagues over the years since it was produced.   I highly recommend this guide to students of all ages now that we are expected to facilitate electronic presentations and materials

R: read.delim2 .txt file and calulate mean of 5 classes

################################################################################
# MC TEST data set by H.Wright                                                 |
#                                                                              |
#                                                                              | 
################################################################################
# set working directory file path
setwd("D:\Projects\R_analysis\mc_30.12.2010\")
# load the test marechiara dataset 
read.delim2 (file="D:\Projects\R_analysis\mc_30.12.2010\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1,
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE)
# assign data table to object 'mc' 
mc<-(read.delim2  (file="D:\Projects\R_analysis\mc_30.12.2010\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1, 
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE))
# obtain the data table column variables
names(mc)
#[1] "year"    "Acartia.clausi"     "Acartia.danae"      "Acartia.discaudata" "Acartia.longiremis" "Acartia.margalefi"      
 
#define variables in object mc
  col2 <-mc$Acartia.clausi
  col3 <-mc$Acartia.danae
  col4 <-mc$Acartia.discaudata
  col5 <-mc$Acartia.longiremis
  col6 <-mc$Acartia.margalefi
#take the mean of each variable
     mean(col2)
          mean(col3)
               mean(col4)
                    mean(col5)
                         mean(col6)

Created by Pretty R at inside-R.org

R: read.delim2 .txt file and calulate mean of 5 classes

################################################################################
# MC TEST data set by H.Wright                                                 |
#                                                                              |
#                                                                              | 
################################################################################
# set working directory file path
setwd("D:\\Projects\\R_analysis\\mc_30.12.2010\\")
# load the test marechiara dataset 
read.delim2 (file="D:\\Projects\\R_analysis\\mc_30.12.2010\\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1,
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE)
# assign data table to object 'mc' 
mc<-(read.delim2  (file="D:\\Projects\\R_analysis\\mc_30.12.2010\\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1, 
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE))
# obtain the data table column variables
names(mc)
#[1] "year"    "Acartia.clausi"     "Acartia.danae"      "Acartia.discaudata" "Acartia.longiremis" "Acartia.margalefi"      
 
#define variables in object mc
  col2 <-mc$Acartia.clausi
  col3 <-mc$Acartia.danae
  col4 <-mc$Acartia.discaudata
  col5 <-mc$Acartia.longiremis
  col6 <-mc$Acartia.margalefi
#take the mean of each variable
     mean(col2)
          mean(col3)
               mean(col4)
                    mean(col5)
                         mean(col6)

Created by Pretty R at inside-R.org

%d bloggers like this: