Statistical analysis of large datasets
Do you want to learn how to create awesome looking charts and get a statistical understanding of your dataset using plots like the one above? Then this guide is for you!
This tutorial will walk you through how to do statistical analysis of a dataset. This dataset is a collection of data from a Computer Fan where each reading has five measurements (observable variables): Accelerometer (X, Y, Z), Tachometer, and PWM.
This is a walk through of the work flow from a raw dataset to understanding the data. The first two parts will walk through the first steps of converting the raw data to a streamed format and looking at it. Below is an overview of the steps:
If you are running in SA Studio Community Edition or do not want to wait for the processing of all histograms you can go straight to step 5 since this model comes with the statistics pre-loaded.
- Creating a meta data model for the dataset
- Converting JSON-object dataset into a stream of Vectors.
- Looking at the raw dataset - can we see some obvious patterns?
- Calculating statistics over the dataset for further analysis.
- Global stats (mean, standard deviation, min, max, count) for each measurement.
- Histograms of all variables and classes.
- Windowed statistics to compare variables.
- Making sense of the data.
- Using Parallel coordinates and Scatter Plots
- Using Scatter plot Matrices
- Comparing histograms
- Final words.
The dataset
The dataset can be downloaded here. It has already been transformed into a streaming format as section 1 in Pre processing describes.
- Accelerometer - a regular accelerometer with x, y, and z measurements.
- Tachometer measures how often the fan does a full Revolution.
- PWM - Pulse width modifier, control signal to the motor.
Classes:
- Low voltage - low voltage supply to the fan (10V istead of 12V).
- Normal.
- Object - small electrical cord stuck in fan.
- Obstructed - (cloth put in front of fan to simulate dust build up.