Data Science Interview Questions and answers
Data Science Interview Questions and answers, are you looking for the best interview questions on Data science? Or hunting for the best platform which provides a list of Top Rated interview questions on Data science for experienced? Then stop hunting and follow Best Data science Training Institute for the List of Top-Rated Data science interview questions and answers for experienced for which are useful for both Fresher’s and experienced.
Are you the one who is a hunger to become Pro certified Data science Developer then ask your Industry Certified Experienced Data science Trainer for more detailed information? Don’t just dream to become Pro-Developer Achieve it learning the Data science Course under world-class Trainer like a pro. Follow the below-mentioned interview questions on Data science with answers to crack any type of interview that you face.
Q1. What is inferential statistics?
It generates larger data and applies probability theory to draw a conclusion
Q2. What is the mean value of statistics?
Mean is the average value of the data set.
Q3. What is Mode value in statistics?
The Most repeated value in the data set
Q4. What is the median value in statistics?
The middle value from the data set
Q5. What is the Variance in statistics?
Variance measures how far each number in the set is from the mean.
Data Science Tutorials
Q6. What is Standard Deviation in statistics?
It is the square root of the variance
Q7. How many types of variables are there in statistics?
1. Categorical variable
2. Confounding variable
3. Continuous variable
4. Control variable
5. Dependent variable
6. Discrete variable
7. Independent variable
8. Nominal variable
9. Ordinal variable
10. Qualitative variable
11. Quantitative variable
12. Random variables
13. Ratio variables
14. ranked variables
Q8. How many types of distributions are there?
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
Q9. What is normal distribution ?
A) It’s like a bell curve distribution. Mean, Mode and Medium are equal in this distribution. Most of the distributions in statistics are a normal distribution.
Q10. What is the standard normal distribution?
If mean is 0 and the standard deviation is 1 then we call that distribution as the standard normal distribution.
Q11. What is Binomial Distribution?
A distribution where only two outcomes are possible, such as success or failure and where the probability of success and failure is the same for all the trials then it is called a Binomial Distribution
Q12. What is the Bernoulli distribution?
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial.
Q13. What is the Poisson distribution?
A distribution is called Poisson distribution when the following assumptions are true:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
Q14. What is the central limit theorem?
a) The mean of the sample means is close to the mean of the population
b) Standard deviation of the sample distribution can be found out from the population standard deviation divided by the square root of sample size N and it is also known as the standard error of means.
c) if the population is not a normal distribution, but the sample size is greater than 30 the sampling distribution of sample means approximates a normal distribution
Q15. What is P-Value, How it’s useful?
The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event.
If the p-value is less than 0.05 (p<=0.05), It indicates strong evidence against the null hypothesis, you can reject the Null Hypothesis
If the P-value is higher than 0.05 (p>0.05), It indicates weak evidence against the null hypothesis, you can fail to reject the null Hypothesis
Q16. What is Z value or Z score (Standard Score), How it’s useful?
Z score indicates how many standard deviations on the element is from the mean. It is also called the standard score.
Z score Formula:
z = (X – μ) / σ
It is useful in Statistical testing.
Z-value is ranged from -3 to 3.
It’s useful to find the outliers in large data
Q17. What is T-Score, What is the use of it?
It is a ratio between the difference between the two groups and the differences within the groups. The larger the score, the more difference there is between groups. The smaller t-score means the more similarity between groups.
We can use t-score when the sample size is less than 30, It is used in statistical testing
Q18. What is IQR ( Interquartile Range ) and Usage?
It is the difference between 75th and 25th percentiles, or between upper and lower quartiles,
It is also called Miss Spread data or Middle 50%.
Mainly to find outliers in data, if the observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR those are considered as outliers.
Formula IQR = Q3-Q1
Q19. What is Hypothesis Testing?
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.
How many Types of Hypothesis Testing are there?
Null Hypothesis and Alternative Hypothesis
Q20. What is a Type 1 Error?
FP – False Positive ( In statistics it is the rejection of a true null hypothesis)
Q21. What is a Type 2 Error?
FN – False Negative ( In statistics it is failing to reject a false null hypothesis)
Q22. What is Univariate, Bivariate, Multivariate Analysis ?
Univarite means single variable – Analysis on single variable data
Bivariate means two variables – you can do analysis on multiple variables
Multi-Variate means multiple variables – Analysis of multiple variables
Q23. Explain the difference between Type I error & Type II error.
Ans. Type I and type II errors are part of the process of hypothesis testing.
Type I errors happen when we reject a true null hypothesis.
Type II errors happen when we fail to reject a false null hypothesis.
Q24. What is Accuracy?
Ans. Accuracy is a metric by which one can examine how good is the machine learning model. Let us look at the confusion matrix to understand it in a better way:
So, the accuracy is the ratio of correctly predicted classes to the total classes predicted. Here, the accuracy will be:
Q25 What is Z-test?
Ans. Z-test determines to what extent a data point is away from the mean of the data set, in standard deviation. For example:
Principal at a certain school claims that the students in his school are above average intelligence. A random sample of thirty students has a mean IQ score of 112. The mean population IQ is 100 with a standard deviation of 15. Is there sufficient evidence to support the principal’s claim?
So we can make use of a z-test to test the claims made by the principal. Steps to perform z-test:
Stating the null hypothesis and alternative hypothesis.
State the alpha level. If you don’t have an alpha level, use 5% (0.05).
Find the rejection region area (given by your alpha level above) from the z-table. An area of .05 is equal to a z-score of 1.645.
Find the test statistics using this formula:
x ̅is the sample mean
σ is population standard deviation
n is sample size
μ is the population mean
If the test statistic is greater than the z-score of the rejection area, reject the null hypothesis. If it’s less than that z-score, you cannot reject the null hypothesis.
To get a better understanding of the topic, refer here.
Q26. What is Ordinal Variable?
Ans. Ordinal variables are those variables that have discrete values but have some order involved. Refer here.
Q27. What is Continuous Variable?
Ans. Continuous variables are those variables that can have an infinite number of values but only in a specific range. For example, height is a continuous variable.
Q28. What is the Correlation?
Ans. Correlation is the ratio of covariance of two variables to a product of variance (of the variables). It takes a value between +1 and -1. An extreme value on both the side means they are strongly correlated with each other. A value of zero indicates a NIL correlation but not a non-dependence. You’ll understand this clearly in one of the following answers.
The most widely used correlation coefficient is the Pearson Coefficient. Here is the mathematical formula to derive the Pearson Coefficient.
Q29. What is Covariance?
Ans. Covariance is a measure of the joint variability of two random variables. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together. The formula for covariance is:
x = the independent variable
y = the dependent variable
n = number of data points in the sample
x bar = the mean of the independent variable x
y bar = the mean of the dependent variable y
A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related
Q30. What is Multivariate Analysis?
Ans. Multivariate analysis is a process of comparing and analyzing the dependency of multiple variables over each other.
For example, we can perform a bivariate analysis of the combination of two continuous features and find a relationship between them.
Q31. What is Multivariate Regression?
Ans. Multivariate, as the word suggests, refers to ‘multiple dependent variables’. A regression model designed to deal with multiple dependent variables is called a multivariate regression model.
Consider the example – for a given set of details about a student’s interests, previous subject-wise score, etc, you want to predict the GPA for all the semesters (GPA1, GPA2, …. ). This problem statement can be addressed using multivariate regression since we have more than one dependent variable.
Q32. What is the Frequentist Statistics?
Ans. Frequentist Statistics tests whether an event (hypothesis) occurs or not. It calculates the probability of an event in the long run of the experiment (i.e the experiment is repeated under the same conditions to obtain the outcome).
Here, the sampling distributions of fixed size are taken. Then, the experiment is theoretically repeated an infinite number of times but practically done with a stopping intention. For example, I perform an experiment with a stopping intention in mind that I will stop the experiment when it is repeated 1000 times or I see a minimum of 300 heads in a coin toss. Read more here.
Q33. What is Descriptive Statistics?
Ans. Descriptive statistics are comprised of those values which explain the spread and central tendency of data. For example, mean is a way to represent the central tendency of the data, whereas IQR is a way to represent the spread of the data.
Q34.What is the Dependent Variable?
Ans. A dependent variable is what you measure and which is affected by the independent/input variable(s). It is called dependent because it “depends” on the independent variable. For example, let’s say we want to predict the smoking habits of people. Then the person smokes “yes” or “no” is the dependent variable.
Q35. What is the Confusion Matrix?
Ans. A confusion matrix is a table that is often used to describe the performance of a classification model. It is an N * N matrix, where N is the number of classes. We form a confusion matrix between the prediction of model classes Vs actual classes. The 2nd quadrant is called type II error or False Negatives, whereas 3rd quadrant is called type I error or False positives
Q36. What is Convex Function?
Ans. A real value function is called convex if the line segment between any two points on the graph of the function lies above or on the graph.
Convex functions play an important role in many areas of mathematics. They are especially important in the study of optimization problems where they are distinguished by a number of convenient properties.
Q37. What is the Cost Function?
Ans. The cost function is used to define and measure the error of the model. The cost function is given by:
h(x) is the prediction
y is the actual value
m is the number of rows in the training set
Let us understand it with an example:
So let’s say, you increase the size of a particular shop, where you predicted that the sales would be higher. But despite increasing the size, the sales in that shop did not increase that much. So the cost applied in increasing the size of the shop gave you negative results. So, we need to minimize these costs. Therefore we make use of cost function to minimize the loss.
Q38. What is Cross-Entropy?
Ans. In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution, rather than the “true”. Cross entropy can be used to define the loss function in machine learning and optimization.
Q39. What is Cross-Validation?
Ans. Cross-Validation is a technique that involves reserving a particular sample of a dataset that is not used to train the model. Later, the model is tested on this sample to evaluate the performance. There are various methods of performing cross-validation such as:
1. Leave one out cross-validation (LOOCV)
2. k-fold cross-validation
3. Stratified k-fold cross-validation
4. Adversarial validation
Q40. What is Data Mining?
Ans. Data mining is a study of extracting useful information from structured/unstructured data taken from various sources. This is done usually for
Mining for frequent patterns
Mining for associations
Mining for correlations
Mining for clusters
Mining for predictive analysis
Data Mining is done for purposes like Market Analysis, determining customer purchase patterns, financial planning, fraud detection, etc
Q41. What is Data Science?
Ans. Data science is a combination of data analysis, algorithmic development, and technology in order to solve analytical problems. The main goal is the use of data to generate business value.
Q42. What is Data Transformation?
Ans. Data transformation is the process to convert data from one form to the other. This is usually done at a preprocessing step.
For instance, replacing a variable x by the square root of x
Q43.What is Dataframe?
Ans. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. DataFrame accepts many different kinds of input:
1. Dict of 1D ndarrays, lists, dicts, or Series
2. 2-D numpy.ndarray
3. Structured or record ndarray
4. A series
5. Another DataFrame
Q44. What is Dataset?
Ans. A dataset (or data set) is a collection of data. A dataset is organized into some type of data structure. In a database, for example, a dataset might contain a collection of business data (names, salaries, contact information, sales figures, and so forth). Several characteristics define a dataset’s structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis.
Q45. What is Decision Boundary?
Ans. n a statistical-classification problem with two or more classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. How well the classifier works depends upon how closely the input patterns to be classified resemble the decision boundary. In the example sketched below, the correspondence is very close, and one can anticipate excellent performance.
Here the lines separating each class are decision boundaries.
Q46. What is a Decision Tree?
Ans. The decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on the most significant splitter/differentiator in input variables.
Read more here.
Q47. What is Dimensionality Reduction?
Ans. Dimensionality Reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. Some of the benefits of dimensionality reduction:
It helps in data compressing and reducing the storage space required
It fastens the time required for performing same computations
It takes care of multicollinearity that improves model performance. It removes redundant features
Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely
It is helpful in noise removal also and as a result of that we can improve the performance of models
Q48. What is Dummy Variable?
Ans. Dummy Variable is another name for the Boolean variable. An example of dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age < 25) and 1 means value is false (i.e. age >= 25)
Q49.What is Deep Learning?
Ans. Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) that uses the concept of the human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously. To understand ANN in detail, read here.
Q50. What is Early Stopping?
Ans. Early stopping is a technique for avoiding overfitting when training a machine learning model with iterative methods. We set the early stopping in such a way that when the performance has stopped improving on the held-out validation set, the model training stops.
For example, in XGBoost, as you train more and more trees, you will overfit your training dataset. Early stopping enables you to specify a validation dataset and the number of iterations after which the algorithm should stop if the score on your validation dataset didn’t increase.