Archive | thesis RSS for this section

abstracting data: signal from noise

In keeping with the previous post about the need to incorporate creative thinking into science, here’s a novel approach that made the NYT headlines this morning.

Instance of a DataSet - Month 7, Floor 6, Pan

Brooklyn based artist Daniel Kohn is working with geneticists on conceptual tools to analyze their large data sets. Using intuitive, perceptual learning combined with his artistic approach to data reduction, Kohn is helping these scientists understand new ways to find signal amongst the noise. Starting with a collaboration in 2003 with research groups in Boston that led to a residency with the Broad Institute for Genetic research, at MIT and Harvard, Kohn’s interest in science expanded to include different mediums: painting, drawing, and computer modeling [1]. Since his pioneering collaboration with Broad, Kohn and several other artists have participated in their artist-in-residence program. More information about the program can be found on their webpage.

Shown in the image above, his series “Instance of a Dataset” culminated with a unique mural for the Broad Institute and an ongoing collaboration between artists and scientists. More images from this installation may be found in his gallery online[2]. Kohn’s work speaks to a type of perceptual thinking and visual learning that we all utilize in our daily lives, with the difference that he is putting these tools into experimental approaches to real world datasets. Call it what you will, creative, intuitive, perceptual, or visual learning, these methods are all part of a new approach to thinking about complex data in novel ways.

Here’s another example of sorting out the signal from noise in a simple dataset from my first thesis[3].

Prochlorococcus pop4 (blue)

Prochlorococcus pop4 (blue)

This is a flow cytometric histogram or density plot showing distinctly different populations of marine cyanobacteria from a station sampled off the coast of South America in 2008. The frozen vial of seawater was analyzed by running a small volume through a flow cytometer and the output is literally a cloud of dots like this. Each dot signifies a particle of a particular size with unique fluorescence properties. The goal is to quantify this mess and distinguish between the background noise and populations of interest. This is accomplished easily with out of the box image analysis software and careful knowledge of the properties inherent in the type of data you’re working with – but there’s still an intuitive nature to this analysis. It can be subjective and open ended when you are hand selecting groups of dots and making artificial cut-offs. There are no steadfast rules to this type of data analysis, and you must be the kind of scientist that can work with imperfect data.

I recently finished a 3-year fellowship working on a unique time-series dataset to extract patterns. Most of my work involved the application and understanding of statistical models. There was a lot of time in front of messy data. There were a lot of visual tools and head scratching. There were things like this simple heat-map that took too much time to construct, a lot more data points that you can imagine – but resulted in a visually interesting approach to think about a dataset[3].

Annual abundance of selected plankton groups from a Long-term data series. Plot by H.A.Wright using the statistical software R.

Heatmap plot showing annual abundance of selected plankton groups from a Long-term data series. Plot by H.A.Wright using the statistical software R.

Using this type of approach leaves the data open to interpretation in a sometimes fuzzy manner, but from the vast types of data and rapidly evolving software, there are new and beautiful ways to think about your science. I’ll feature innovative artists and scientists from time to time on this webpage. Feel free to comment or ask further questions about my previous work.


  1. Kohn, Daniel,
  2. Kohn, Daniel, Online Flickr gallery Commissions Broad Institute 2013, “Instance of a Dataset”  url:
  3. Wright, H.A., Biogeographical analysis of picoplankton populations across the Patagonian Shelf Break during austral summer, MS Thesis, 2010.
  4. Wright, H.A. MPhil thesis: Long-term variability of plankton phenology in a coastal, Mediterranean time series (LTER-MC), 2013.

To pre-process or not to pre-process (this is the data)

A short wiki def 1

Data pre-processing is an often neglected but important step in the data mining process. The phrase “Garbage In, Garbage Out” is particularly applicable to data mining and machine learning projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.

Let me preface this post with a working definition of process.  When I refer to processing these days, it usually means data. If I’m out at sea, I’ll refer to processing (samples) which means “running” samples through a machine to obtain output – or data in this case.  Data analysis involves two levels of processing – the initial step being sample processing and the second is data processing.  Thus, these two components result in meaningful information.  There is a third type of processing in data acquisition and analysis termed pre-processing.  In my own definition of this method, pre-processing is a way of “combing” data or cleaning up a complex and often messy raw data set so that clean statistical analysis can occur.  Raw, environmental data can contain errors, missing data points, negative values, etc. and when trying to discern specific relationships within a dataset, these issues can lead to errors in both graphical and statistical interpretation.

An example of pre-processing an original data file:

The need to pre-process the data arises for a number of reasons.  If any of you have worked with SAS input files, that’s a good example of pre-processing combed and analyzed data for the final purpose of inputting/sending a SAS-compatible data file for statistical analysis.  This is not quite as extreme, but just as function.  In this case, readability is one issue.  I have poor eyesight and need my data to show my specific results in a clear fashion.  Due to my primitive knowledge of R, I have not constructed elaborate time series data frames or tables that can quickly tabulate and spit out the numerical analysis.  Thus, the need for pre-processing in this instance serves a few purposes.  1) It meets my immediate needs of extracting specific information or subsets of data from a larger dataset, and 2) It will provide a framework for future data input for this project.  Since the data is historically maintained in this method, it’s a bridge between the old method and the new analysis approach.  In the image I’ve show the previous approach on the left, augmented with an extracted arithmetic mean which I’ve then used to translate into the pre-processed file shown on the right hand image.

What’s the difference? Why bother? Is it necessary? Well, to answer the first question, the difference is a clear layout of yearly mean values ordered by date and species such the information is more accessible.  When I import this data set into another analysis program such as R, Matlab, or GnuPlot, it makes life easier.  Why bother? If it makes your life easier, it’s always worth it. It may take a small amount of time to translate and setup a new data file, but it is worth every minute if it makes analysis clear.  If you run statistical tests on the data and have to hunt and peck through the original dataset, it’s a nightmare.  If you use a pre-processed file that contains only the subset of data you wish to analyze, the results of tests between parameters become clear. (This is my biased opinion of course!) and finally, is it necessary?  Scroll down to the section in this post entitled: Dealing with real-life data sets. I even love the title!


1. “Data pre-processing” last modified 13 December 2010,
2.  Sastry, Nishanth R. in “Visualize your data with gnuplot” accessed 3 Feb 2011,

concept map (rough)

This is what I started with:


concept map (rough)

This is what I started with:


First successful script in R!

What better way  to end the week but with a successful run in R.  Read More…

First successful script in R!

What better way  to end the week but with a successful run in R.  Read More…

%d bloggers like this: