Statistics is one of the most popular and important subjects for the data scientist . It has various methods that are helpful to solve the most complex problems of real life. Statistics is almost everywhere. Data science and data analysts use it to have a look on the meaningful trends in the world. Besides, statistics has the power to drive meaningful insight from the data.
Statistics offers a variety of functions, principles, and algorithms. That is helpful to analyze raw data, build a Statistical Model and infer or predict the result.
Math and Statistics for Data Science are essential because these disciples form the basic foundation of all the Machine Learning Algorithms. In fact, Mathematics is behind everything around us, from shapes, patterns, and colors, to the count of petals in a flower. Mathematics is embedded in each and every aspect of our lives.
Although having a good understanding of programming languages, Machine Learning algorithms and following a data-driven approach is necessary to become a Data Scientist, Data Science isn’t all about these fields. In this blog post, you will understand the importance of Math and Statistics for Data Science and how they can be used to build Machine Learning models.
Statistics is used to process complex problems in the real world so that Data Scientists and Analysts can look for meaningful trends and changes in Data. In simple words, Statistics can be used to derive meaningful insights from data by performing mathematical computations on it.
Several Statistical functions, principles, and algorithms are implemented to analyze raw data, build a Statistical Model and infer or predict the result.
Numerical: Numerical data types are those data types which are expressed with digits. These data types are measurable. There are two major types of data types i.e. discrete and continuous.
Categorical: Categorical data types are qualitative data and it is classified into categories. There are two types of major categorical data types i.e. nominal (no order) or ordinal (ordered data).
Z-score: Z score determines the number of standard deviations a data point is from the mean.
R-Squared: R square is a statistical measure of fit. It used to indicate how much variation of a dependent variable is explained by the independent variable(s). We can use it only for the simple linear regression.
Adjusted R-squared: It is similar to the R squared and also R square modified version. It has been adjusted for the number of predictors in the model. It decreases if the old term improves the model more than would be expected by chance and vice versa.
Measurements of Relationships between Variables
Covariance: If we want to find the difference between two variables then we use the covariance. It is based on the philosophy that if it is positive then they tend to move in the same direction. Or if it’s negative then they tend to move in opposite directions. There will also be no relation with each other, if they are zero.
Correlation: Correlation is all about to measure the strength of a relationship between two different variables. It ranges from -1 to 1. It is the normalized version of co-variance. Most of the time the correlation of +/- 0.7 represents a strong relationship between two different variables. On the other hand, there is no relationship between variables when the correlations between -0.3 and 0.3
The Bayes’ theorem is the most popular mathematical formula. It is used to determine the conditional probability. It is based on the methodology that the probability of A given B is equal to the probability of B given A times the probability of A over the probability of B”.