To pre-process or not to pre-process (this is the data)
A short wiki def 1
Data pre-processing is an often neglected but important step in the data mining process. The phrase “Garbage In, Garbage Out” is particularly applicable to data mining and machine learning projects. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.
Let me preface this post with a working definition of process. When I refer to processing these days, it usually means data. If I’m out at sea, I’ll refer to processing (samples) which means “running” samples through a machine to obtain output – or data in this case. Data analysis involves two levels of processing – the initial step being sample processing and the second is data processing. Thus, these two components result in meaningful information. There is a third type of processing in data acquisition and analysis termed pre-processing. In my own definition of this method, pre-processing is a way of “combing” data or cleaning up a complex and often messy raw data set so that clean statistical analysis can occur. Raw, environmental data can contain errors, missing data points, negative values, etc. and when trying to discern specific relationships within a dataset, these issues can lead to errors in both graphical and statistical interpretation.
An example of pre-processing an original data file:
The need to pre-process the data arises for a number of reasons. If any of you have worked with SAS input files, that’s a good example of pre-processing combed and analyzed data for the final purpose of inputting/sending a SAS-compatible data file for statistical analysis. This is not quite as extreme, but just as function. In this case, readability is one issue. I have poor eyesight and need my data to show my specific results in a clear fashion. Due to my primitive knowledge of R, I have not constructed elaborate time series data frames or tables that can quickly tabulate and spit out the numerical analysis. Thus, the need for pre-processing in this instance serves a few purposes. 1) It meets my immediate needs of extracting specific information or subsets of data from a larger dataset, and 2) It will provide a framework for future data input for this project. Since the data is historically maintained in this method, it’s a bridge between the old method and the new analysis approach. In the image I’ve show the previous approach on the left, augmented with an extracted arithmetic mean which I’ve then used to translate into the pre-processed file shown on the right hand image.
What’s the difference? Why bother? Is it necessary? Well, to answer the first question, the difference is a clear layout of yearly mean values ordered by date and species such the information is more accessible. When I import this data set into another analysis program such as R, Matlab, or GnuPlot, it makes life easier. Why bother? If it makes your life easier, it’s always worth it. It may take a small amount of time to translate and setup a new data file, but it is worth every minute if it makes analysis clear. If you run statistical tests on the data and have to hunt and peck through the original dataset, it’s a nightmare. If you use a pre-processed file that contains only the subset of data you wish to analyze, the results of tests between parameters become clear. (This is my biased opinion of course!) and finally, is it necessary? Scroll down to the section in this post entitled: Dealing with real-life data sets. I even love the title!