HowTo-Get summary/str/boxplot/scatterplot kind of information for a dataset in AzureML

Govind Kanshi
3 min readAug 5, 2014

We will look at the fires dataset to see how we can explore it to get the distribution of data, unique or missing values.

By default AzureML dataset’s visualize option (obtained by clicking on the output port) gets generic idea about the dataset. It shows up nice information about the distribution about the dataset across 10 bins in simple histogram. It also shows “unique values”, “missing values”, min/max/sd/mean/median,missing values and data type of the feature.

Simple Data analysis is available as “visualize” off the dataset’s output port.

max/min/median/mean/sd/unique values/missing values/Data Type of the feature information

The above also provides at glance information of feature’s distribution of data in form of histogram. And as you would expect clicking on “view as” results in box-plots.

Boxplot of the data
Scatter Plot comparing two items
Multiboxplot comparing two features

MultiBoxPlot is unique in terms of plot automatically deciding — we have of categorical items — so let us do a multi-box plot.

Taking it further there is also Cross Tab.

CrossTab — two categorical data points

Histograms provide ability to add cumulative distribution & probability density. All plots allow scaling of features. Histograms allow changing the # of bins to get more granular information.

Histogram with probability density

Visualization of data also has ability to take snapshots of the visualization and remove them.

Snapshots on the right of the MultiBoxPlot are persistent

More descriptive Data Analysis is available from “Descriptive Statistics module” and the “Co-relation Module”.

More Descriptive analysis including skewness,percentile distribution,range

Negative Skewness implies mean of the data values is less the median and vice versa.

In the detailed analysis kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, aka platykurtic. Positive kurtosis would indicates a peaked distribution, aka leptokurtic. The normal distribution has zero kurtosis, and is said to be mesokurtic.

Correlation of data

Why is Nan present for Month/Day? — i.e because they are categorical data points.

Here is how modules are connected to get detailed analysis and correlation. First data anaylsis is available off the main dataset itself as visualize. Rest are also available as a click on visualize off the respective modules’s output port.

Modules for data exploration

--

--

Govind Kanshi

I help create reliable, pragmatic software solutions using the dainty words like Cloud and Data. I work at Azure Cosmos DB team.