Top 50 Machine Learning Interview Questions and Answers

Q1) You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do?

Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. The following are the methods you can use to tackle. such a situation: Since we are having low RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use. We can randomly sample the data set. This means we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations. To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use the chi-square test. Also, we can use and pick the components which can explain the maximum variance in the data set. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option. Building a linear model using Stochastic Gradient Descent is also helpful. We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in a significant loss of information.

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points. If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select the number of components to explain variance in the data set.

Q3. You are given a data set. The data set has missing values that spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since the data is spread across the median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

Q4. You are given a data set on cancer detection. You’ve built a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class-wise performance of the classifier. If the minority class performance is found to be poor, we can undertake the following steps: We can use undersampling, oversampling or SMOTE to make the data balanced. We can alter the prediction threshold value by doing and finding an optimal threshold using the AUC-ROC curve. We can assign a weight to classes such that the minority classes get larger weight. We can also use anomaly detection.

Q5. Why is naive Bayes so ‘naive’?

Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumptions are rarely true in a real-world scenario.

Q6. Explain prior probability, likelihood and marginal likelihood in the context of naiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam. The likelihood is the probability of classifying a given observation as 1 in the presence of some other variable. For example, the probability that the word ‘FREE’ is used in the previous spam message is a likelihood. The marginal likelihood is the probability that the word ‘FREE’ is used in any message.

Q7. You are working on a time series data set. Your manager has asked you to build a high accuracy model. You start with the decision tree algorithm since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than the decision tree model. Can this happen? Why?

Answer: Time series data is known to possess linearity. On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions. The reason why the decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did. Therefore, we learned that a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions

Q8. You are assigned a new project which involves helping a food delivery company to save more money. The problem is, the company’s delivery team isn’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals. This is not a machine learning problem. This is a route optimization problem. A machine learning problem consists of three things: 1. There exist a pattern. 2. You cannot solve it mathematically (even by writing exponential equations). 3. You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Answer: Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like a great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on unseen data, it gives disappointing results. In such situations, we can use the bagging algorithm (like random forest) to tackle high variance problems. Bagging algorithms divide a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression). Also, to combat high variance, we can: Use the regularization techniques, where higher model coefficients get penalized, hence lowering model complexity. Use top n features from the variable importance chart. Maybe, with all the variables in the data set, the algorithm is having difficulty in finding a meaningful signal.

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated. For example, You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variables, which is misleading.

Top 50 Machine Learning Interview Questions and Answers

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of the models could perform better than the benchmark score. Finally, you decided to combine those models. Though ensembled models are known to return high accuracy, you are unfortunate. Where did you miss it?

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior results when the combined models are uncorrelated. Since we have used 5 GBM models and got no accuracy improvement, it suggests that the models are correlated. The problem with correlated models is, all the models provide the same information For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built over the premise of combining weak uncorrelated models to obtain better predictions.

Q12. How is kNN different from kmeans clustering?

Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm. kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels. kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as a lazy learner because it involves minimal training of the model. Hence, it doesn’t use training data to make a generalization on the unseen data sets.

Q13. How is True Positive Rate and Recall related? Write the equation?

Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, you remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Answer: Yes, it is possible. We need to understand the significance of the intercept term in a regression model. The intercept term is showing model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 – Σ(y – y´)²/Σ(y – ymean)² where y´ is predicted value. When the intercept term is present, the R² value evaluates your model wrt. to the mean model. In absence of intercept term ( ymean), the model can make no such evaluation, with large denominator, Σ(y - y´)²/Σ(y)² equation’s value becomes smaller than actual, resulting in higher R².

Q15. After analyzing the model, your manager has informed us that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having a correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value<= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity. But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in the correlated variables so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Q16. When is Ridge regression favorable over Lasso regression?

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in the presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small/medium-sized effects, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In the presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the east square estimates have higher variance. Therefore, it depends on our model objective.

Q17. The rise in global average temperature led to a decrease in the number of pirates around the world. Does that mean that a decrease in the number of pirates caused climate change?

Answer: After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that the decrease in the number of pirates caused climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon. Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of the rise in global average temperature.

Q18. While working on a data set, how do you select important variables? Explain your methods?

Answer: Following are the methods of variable selection you can use: 1. Remove the correlated variables prior to selecting important variables 2. Use linear regression and select variables based on p values 3. Use Forward Selection, Backward Selection, Stepwise Selection 4. Use Random Forest, Xgboost and plot variable importance chart 5. Use Lasso Regression 6. Measure information gain for the available set of features and select top n features accordingly.

Q19. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances that can’t be compared because of having unequal scales. To combat such a situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

Q20. Is it possible to capture the correlation between continuous and categorical variables? If yes, how?

Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture the association between continuous and categorical variables.

Q21. Both being a tree-based algorithm, how is random forest different from the Gradient boosting algorithm (GBM)?

Answer: The fundamental difference is, random forest uses bagging techniques to make predictions. GBM uses boosting techniques to make predictions. In the bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done in parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continues until a stopping criterion is reached. Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model