Archive | software RSS for this section

To pre-process or not to pre-process (this is the data)

A short wiki def 1

Data pre-processing is an often neglected but important step in the data mining process. The phrase “Garbage In, Garbage Out” is particularly applicable to data mining and machine learning projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.

Let me preface this post with a working definition of process.  When I refer to processing these days, it usually means data. If I’m out at sea, I’ll refer to processing (samples) which means “running” samples through a machine to obtain output – or data in this case.  Data analysis involves two levels of processing – the initial step being sample processing and the second is data processing.  Thus, these two components result in meaningful information.  There is a third type of processing in data acquisition and analysis termed pre-processing.  In my own definition of this method, pre-processing is a way of “combing” data or cleaning up a complex and often messy raw data set so that clean statistical analysis can occur.  Raw, environmental data can contain errors, missing data points, negative values, etc. and when trying to discern specific relationships within a dataset, these issues can lead to errors in both graphical and statistical interpretation.

An example of pre-processing an original data file:

The need to pre-process the data arises for a number of reasons.  If any of you have worked with SAS input files, that’s a good example of pre-processing combed and analyzed data for the final purpose of inputting/sending a SAS-compatible data file for statistical analysis.  This is not quite as extreme, but just as function.  In this case, readability is one issue.  I have poor eyesight and need my data to show my specific results in a clear fashion.  Due to my primitive knowledge of R, I have not constructed elaborate time series data frames or tables that can quickly tabulate and spit out the numerical analysis.  Thus, the need for pre-processing in this instance serves a few purposes.  1) It meets my immediate needs of extracting specific information or subsets of data from a larger dataset, and 2) It will provide a framework for future data input for this project.  Since the data is historically maintained in this method, it’s a bridge between the old method and the new analysis approach.  In the image I’ve show the previous approach on the left, augmented with an extracted arithmetic mean which I’ve then used to translate into the pre-processed file shown on the right hand image.

What’s the difference? Why bother? Is it necessary? Well, to answer the first question, the difference is a clear layout of yearly mean values ordered by date and species such the information is more accessible.  When I import this data set into another analysis program such as R, Matlab, or GnuPlot, it makes life easier.  Why bother? If it makes your life easier, it’s always worth it. It may take a small amount of time to translate and setup a new data file, but it is worth every minute if it makes analysis clear.  If you run statistical tests on the data and have to hunt and peck through the original dataset, it’s a nightmare.  If you use a pre-processed file that contains only the subset of data you wish to analyze, the results of tests between parameters become clear. (This is my biased opinion of course!) and finally, is it necessary?  Scroll down to the section in this post entitled: Dealing with real-life data sets. I even love the title!

References

1. “Data pre-processing” last modified 13 December 2010, http://en.wikipedia.org/wiki/Data_Pre-processing
2.  Sastry, Nishanth R. in “Visualize your data with gnuplot” accessed 3 Feb 2011, http://www.ibm.com/developerworks/library/l-gnuplot/

R: read.delim2 .txt file and calulate mean of 5 classes

################################################################################
# MC TEST data set by H.Wright                                                 |
#                                                                              |
#                                                                              | 
################################################################################
# set working directory file path
setwd("D:\\Projects\\R_analysis\\mc_30.12.2010\\")
# load the test marechiara dataset 
read.delim2 (file="D:\\Projects\\R_analysis\\mc_30.12.2010\\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1,
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE)
# assign data table to object 'mc' 
mc<-(read.delim2  (file="D:\\Projects\\R_analysis\\mc_30.12.2010\\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1, 
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE))
# obtain the data table column variables
names(mc)
#[1] "year"    "Acartia.clausi"     "Acartia.danae"      "Acartia.discaudata" "Acartia.longiremis" "Acartia.margalefi"      
 
#define variables in object mc
  col2 <-mc$Acartia.clausi
  col3 <-mc$Acartia.danae
  col4 <-mc$Acartia.discaudata
  col5 <-mc$Acartia.longiremis
  col6 <-mc$Acartia.margalefi
#take the mean of each variable
     mean(col2)
          mean(col3)
               mean(col4)
                    mean(col5)
                         mean(col6)

Created by Pretty R at inside-R.org

R: read.delim2 .txt file and calulate mean of 5 classes

################################################################################
# MC TEST data set by H.Wright                                                 |
#                                                                              |
#                                                                              | 
################################################################################
# set working directory file path
setwd("D:\Projects\R_analysis\mc_30.12.2010\")
# load the test marechiara dataset 
read.delim2 (file="D:\Projects\R_analysis\mc_30.12.2010\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1,
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE)
# assign data table to object 'mc' 
mc<-(read.delim2  (file="D:\Projects\R_analysis\mc_30.12.2010\mc_test1.txt", 
          na.strings = "NA", 
          nrows = -1, 
          skip = 0, 
          check.names = TRUE, 
          strip.white = FALSE, 
          blank.lines.skip = TRUE))
# obtain the data table column variables
names(mc)
#[1] "year"    "Acartia.clausi"     "Acartia.danae"      "Acartia.discaudata" "Acartia.longiremis" "Acartia.margalefi"      
 
#define variables in object mc
  col2 <-mc$Acartia.clausi
  col3 <-mc$Acartia.danae
  col4 <-mc$Acartia.discaudata
  col5 <-mc$Acartia.longiremis
  col6 <-mc$Acartia.margalefi
#take the mean of each variable
     mean(col2)
          mean(col3)
               mean(col4)
                    mean(col5)
                         mean(col6)

Created by Pretty R at inside-R.org

literate writing

From a compelling post by Dataists: What’s the use of sharing code nobody can read?

Who’s got the time to write a whole PDF every time you want to draw a bar chart? We’ve got press deadlines, conference deadlines, and a public attention span measurable in hours. Writing detailed comments is not only time consuming, it’s often a seriously complicated affair akin to writing an academic paper. And, unless we’re actually writing an academic paper, it’s mostly a thankless task.

I agree with this statement from my current perspective as a graduate student. Documentation and comments throughout the written code of a program or script for example (in R) serve an extremely important purpose at the early stage of the game. I am learning, and documenting the steps involved in the data analysis. Perhaps I am in danger of what these authors suggest as “stream of consciousness” writing.

Dataists also suggest that ProjectTemplate for R may be an efficient way to appease both hardline coders as well as those who adhere to style conventions.  In my blogging prose, I confess that I employ a stream of consciousness approach that allows me to delve into topics without much emphasis on using correct grammar or punctuation.  I apologize to those readers who have to suffer through the terrible writing! E.B. White would be appalled! Should we self-correct so strictly that it prevents us from putting our thoughts down? No, I believe (especially in brainstorming) that writing requires a certain level of nonconformity to inspire our inner muse.  However, in the business of scientific data analysis precise terminology and adherence to standards of conformity is essential.

As I start to compose scripts to process data,  I am thinking about documenting the work quite carefully.  I hope this will provide a reference for myself and others in the future.  I encourage you to read through the blog post again if you haven’t taken the hint yet so that you can begin to incorporate this approach in your projects.  The topic is worthy of further discussion.

References:

The Elements of Style Illustrated (2005), by William Strunk Jr., E.B. White and Maira Kalman (Illustrator), ISBN 1-59420-069-6

David Parnas and Paul Clements. “A Rational Design Process: How and Why to Fake It” in Software State-of-the-Art: Selected Papers. Dorset House, 1990, pg. 353-355

First successful script in R!

What better way  to end the week but with a successful run in R.  Read More…

First successful script in R!

What better way  to end the week but with a successful run in R.  Read More…

%d bloggers like this: