Statistical operations allow data analysts and Python developers to get an idea of the data range or data dispersion of a given dataset. The variance and standard deviation are two common statistics operations used for finding data dispersion, collective data analysis, and individual observations in any data. In this tutorial, you will learn the different approaches to calculate the variance & the standard deviation in Python.
What are Variance and Standard Deviation?
Variance helps in measuring how far a number or value of a dataset is from the mean or average value. The variance measurement explicitly helps in quantifying the spread or dispersion of a series of data. The term 'Spread' defines the state or population by describing how much variation there is in the data. When the variance is high, it means, the dataset values are far from their average. Again, if the variance is low, it means our dataset values are drawing closer to the mean.
Standard deviation, on the other hand, is the square root of the variance that helps in measuring the expense of variation or dispersion in your dataset. It determines the deviation of each data point relative to the mean. A lower standard deviation indicates that the values are closer to the mean value. Again, a higher standard deviation indicates that the data are dispersed out in a wide range.
Variance in Python:
There are different ways to extract the variance of a data set in Python. Here are the methods mentioned.
Method 1: The general Approach:
In this method, you will use the predefined functions (sum() and len()) of Python to create a variance function that will take a series of data as input parameters. This technique does not require any external library or module to import.
def variance(val): numb = len(val) # m will have the mean value m = sum(val) / numb # Square deviations devi = [(x - m) ** 2 for x in val] # Variance variance = sum(devi) / numb return variance print(variance([6, 6, 3, 9, 4, 3, 6, 9, 7, 8]))
Here we have created a user-defined function name variance() that takes the data set a single parameter. Next, we create an object numb that calculates the length of the data set. Next, we calculated the mean and used this formula ((x - m) ** 2 for x in val) to find the deviation of all these values. Lastly, we calculate the variance manually by placing the deviation value within the sun() method like this: sum(devi) / numb; a divide it with numb and return the calculated variance value.
Method 2: Using numpy.var() Method:
We can use the NumPy (Numerical Python) library that contains the var() method to find the variance of a data set.
Its syntax is:
numpy.var(x, axis = None, dtype = None, output = None, keepdims =<no value>)
where the parameters are:
- x: This is an array that holds the data whose mean value is required
- axis: It is the axis or axes to average a
- dtype: It tells us about the type of data you can use for computing the variance.
- output: This is an alternate output array where you can place the result.
- keepdims: The axis that got reduced are left in the result as dimension
import numpy as np # assigning the list of elements to li li = [6, 6, 3, 9, 4, 3, 6, 9, 7, 8] print(np.var(li))
Here we have to install and then import the numpy module. Also, in the import statement, we have aliased it with the term ‘np’. Then we have created a list with the name li having a set of values. Lastly we have called the np.var() which will calculate the variance of the given data set and the print() function will print its value.
Method 3: Using the Statistics Module:
Statistics is a standard Python module that is a standard module containing various functions that deal with the calculation of basic statistical operations on data. It has two functions - the statistics.pvariance() and statistics.variance() used for calculating the variance of a population and sample respectively.
import statistics print(statistics.pvariance([6, 6, 3, 9, 4, 3, 6, 9, 7, 8])) print(statistics.variance([6, 6, 3, 9, 4, 3, 6, 9, 7, 8]))
In this program, we have imported the statistics method. Then, we have called the statistics.pvariance() and statistics.variance() method by passing a set of data in it as list and print that data.
Standard Deviation in Python:
There are different ways to find the standard deviation of a set of data in Python. Here are the methods mentioned.
Method1: Using Math Module:
In this method, you will use the predefined functions (sum() and len()) of Python to create a variance function and then square root (using the math.sqrt() method) the overall value of the variance to get the standard deviation.
import math # Finding the variance is essential before calculating the standard deviation def varinc(val, ddof=0): n = len(val) m = sum(val) / n return sum((x - m) ** 2 for x in val) / (n - ddof) # finding the standard deviation def stddev(val): vari = varinc(val) stdev = math.sqrt(vari) return stdev print(stddev([5, 9, 6, 2, 6, 3, 7, 4, 8, 6]))
Here we have import the math module. Then we create a user-defined function named
varinc(). This function takes two parameters, one will be the data and the other will be the delta degree of freedom value. We, then calculate the variance using the sum((x - m) ** 2 for x in val) / (n - ddof) formula. Again, we have to create another user-defined function named stddev(). This function takes only 1 parameter – the data set whose standard deviation needs to be calculated. Finally we print the calculated value of standard deviation like this print(stddev).
Module 2: Using Statistics module:
The statistics module of Python also provides functions to calculate the standard deviation in two different variations. The pstdev() and stdev() return the standard deviation by taking the data of an entire population and from any sample respectively.
import statistics populated = statistics.pstdev([5, 9, 6, 2, 6, 3, 7, 4, 8, 6]) sample = statistics.stdev([5, 9, 6, 2, 6, 3, 7, 4, 8, 6]) print(populated) print(sample)
Here we have to first import the statistics module. Then, we have created an object that will hold the data returned by the statistics.pstdev(). We have to pass a list of data as parameters for both of them. Next, the sample object will hold another data returned by the statistics.stdev(). Finally, we are printing both the objects containing the calculated data.
All of these methods are worth remembering. But using the general len() and sum() functions of the math module, if you calculate the variance and standard deviation for data analysis, it will comparatively work faster than the rest. But, if you feel the need for more and more functions of statistical operations, then, using the statistics module will benefit you in terms of efficiency. If you do not require to use the NumPy module in any other case, better to use other techniques to find the variance.