Machine learning mock interview

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article at OpenGenus, we will see how a machine learning interview for a data science job will be.

Machine learning is one of the areas whose knowledge is exclusively tested out when we give a data science interview. It is mostly kept as a separate round of interview in the rounds of technical interviews. We will consider an imaginary candidate here for ease of understanding. Usually, similar to every other round of interview, this too starts with an introduction which we are already familiar of.

Companies hire machine learning developers globally and as a candidate, you should be well prepared for the interview format to secure a job as a Machine Learning Engineer and get to work on this fast growing field.

Interviewer : Let's start from the basics. What is the difference between a parametric learning algorithm vs a non-parametric learning algorithm?

In a parametric machine learning algorithm, we have a finite number of parameters with respect to the sample size. On the other hand, the number of parameters for a non-parametric algorithm is potentially infinite. In other words, the complexity of a non-parametric model grows with the number of records in training data where as we have fixed number parameters in a para metric model. Statistically, we have an assumed distribution that our data follows in a parametric model whereas the non-parametric model is 'distribution-free' model.

Interviewer : What is the difference between inductive machine learning and deductive machine learning?

Inductive learning	Deductive learning
Draws conclusion after observing and learning from a set of instances	Derives conclusion and then work on it based on the previous decision
Inductive reasoning uses a bottom-up approach	Deductive reasoning uses a top-down approach
It moves from specific observation to a generalization	It moves from generalized statement to a valid conclusion
The conclusions are probabilistic	The conclusions are certain

Interviewer : Suppose you're building a model to predict booking prices on Airbnb. Which model would perform better, linear regression or random forest regression?

For a question like this, we need to explain about both the algorithms and then explain our choice. After explaining about both the algorithms, suppose our candidate chose the random forest regression, it can be justified as follows:

I choose to go with random forest regression model as it tends to perform better with categorical predictors. Compared to linear regression, random forest can handle cardinality and missing values well and is also not sizably impacted by outliers.

Interviewer : How do you interpret linear regression coefficients?

The regression coefficient in linear regression tells us how much the mean of the dependent variable changes for a unit shift in that when all the other variables are constant.

Interviewer : We don’t necessarily apply feature scaling in our simple linear regression model. Why is that?

In simple words, since y ( the dependent variable) is a linear combination of the many independent variables, the coefficients can adapt their scale to put everything on the same scale. We will get an equivalent solution whether we apply some kind of feature scaling or not. Hence we don't explicitly need to perform feature scaling in simple linear regression.

Interviewer : Can you build a simple linear regression model?

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
 
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
 
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
 
y_pred = regressor.predict(X_test)

Interviewer : Would an additional feature improve Logistic Regression more?

Additional features does not necessarily improve the performance of a logistic regression model. With only increase in the number of features and without a multiplicative increase in the number of observations, it may lead to a complex dataset problem where we have dad with many features and lesser number of observations.

Interviewer : In what scenario would you prefer a decision tree over random forest?

Decision trees are much easier to interpret and understand than random forest. Since a random forest combines multiple decision trees, it becomes more difficult to interpret and is used when need to interpret the model is not a major concern for us. But ultimately, it depends on our goal. The trade off is between interpretability vs. accuracy. If we care about communicating the reasons behind your predictions, we should definitely prefer a decision tree. Another less obvious example where a decision tree is preferred over a random forest is if we are using decision trees not to predict out of sample, but rather to understand the data we have.

Interviewer : Let’s say you want to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How would you fix this issue?

One way to fix the issue is to categorize the output class into various groups based on the hours of flight delays. These might be:

Delays less than 3 hours
Delays between 3-10 hours
Delays greater than 10 hours

This skews out the outliers into a specific classification problem rather than in the regression. Another effective way is to filter the outliers by analysis.

Interviewer : Suppose we have 1 million app rider journey trips in the city of Berlin. We want to build a model to predict ETA after a rider makes a ride request.How would we know if we have enough data to create an accurate enough model?

This question is asked to asses the candidate's ability to approach a machine learning problem by weighing down all necessary factors. Our first statement must be that the process of collecting data can be costly. The next question must be whether is this the first version of the model. If yes, we should consider some factors before we come to a conclusion about the amount of data available:

Check whether we have an optimal ratio of feature set size to training data size. If we have large number of features and in comparison have a smaller sample size, then the model if prone to overfitting and inaccuracies in prediction.
To get a basic idea, hold out some of the data while training the model and use it to check the model's performance.

Interviewer : Given some inputs, write a gradient descent function.

def  cal_cost(theta,X,y):
    '''
    
    Calculates the cost for given X and Y. The following shows and example of a single dimensional X
    theta = Vector of thetas 
    X     = Row of X's np.zeros((2,j))
    y     = Actual y's np.zeros((2,1))
    
    where:
        j is the no of features
        
    This function is internally called by the gradient descent function
    '''
    
    m = len(y)
    
    predictions = X.dot(theta)
    cost = (1/2*m) * np.sum(np.square(predictions-y))
    return cost
    
def gradient_descent(X,y,theta,learning_rate=0.01,iterations=100):
    '''
    X    = Matrix of X with added bias units
    y    = Vector of Y
    theta=Vector of thetas np.random.randn(j,1)
    learning_rate 
    iterations = no of iterations
    
    Returns the final theta vector and array of cost history over no of iterations
    '''
    m = len(y)
    cost_history = np.zeros(iterations)
    theta_history = np.zeros((iterations,2))
    for it in range(iterations):
        
        prediction = np.dot(X,theta)
        
        theta = theta -(1/m)*learning_rate*( X.T.dot((prediction - y)))
        theta_history[it,:] =theta.T
        cost_history[it]  = cal_cost(theta,X,y)
        
    return theta, cost_history, theta_history

Interviewer : What is gradient descent?

A gradient can be thought of as the slope of a function. Basically, it measures the amount of change in the output function when the inputs are changed a little bit. It is an optimization algorithm used to find values of parameters of an activation function that minimizes the function.

Interviewer : What is the difference between stochastic gradient descent (SGD) and gradient descent ?

Gradient descent is an iterative method where the gradient is calculated precisely from all the data points. Where as in stochastic gradient descent, we consider only a single point in loss function and its derivative randomly. This is the only major difference between gradient descent and stochastic gradient descent.

Interviewer : What is KL divergence, how would you define its usecase in ML?

Kullback-Leibler (KL) Divergence measures how one probability distribution differs from another probability distribution. It is non-symmetric and an entropy based calculation. KL divergence is the expectation of the log difference between the probability of data in the original distribution with the approximating distribution and essentially tells us how many bits of information can be expected to be lost. It is also closely related to cross-entropy such that minimizing cross-entropy is the same as optimizing KL divergence. KL divergence is denoted by the formula:
D_KL(p||q) = ∑_i=1^N p(xi)⋅(log p(xi)−log q(xi))

Interviewer : Can you implement L1 regularization in any ML model of your choice?

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X = iris.data
y = iris.target
 
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
 X_new = model.transform(X)

With this mock interview at OpenGenus, you must have a good experience of going through a Machine Learning interview.