OpenSource Internship opportunity by OpenGenus for programmers. Apply now.
This is the most complete Data Science Cheatsheet which you should follow to revise all Data Science concepts within 30 minutes and get ready for Interviews and stay in form.
Data science lifecycle  

Phase  Description  
Discovery  
Understanding data  Involves describing what data is needed, how relevant are they and finally extracting the required data.  
Data preparation  
Data analysis  
Model planning  Decide on our machine learning model based on the business problem.  
Model building and deployment  Create and evaluate the ML model and finally deploy it in the preferred environment.  
Communication of results 
Machine Learning  

Supervised Learning  Type of machine learning technique where models are trained using labeled data as inputs. Commonly used fore regression and classification tasks.  
Unsupervised Learning  Type of machine learning technique where models are trained using unalbeled data as inputs. Used for extracting information from large amounts of data.  
Semisupervised Learning  Combination of supervised and unsupervised learning where a small amount of inputs are labeled and large portions of them are unlabeled.  
Reinforcement Learning  This is a machine learning technique concerned with teaching agents to take decisions in environment to maximize the reward.  
Regression  
Classification  These algorithms are used to categorize the given test data accurately, such as telling apart a cat from a dog.  
Ensemble Learning  Ensemble methods helps improve the performance of a machine learning model by combining several ML base models to produce one single predictive model.  
Recommender Systems 
Supervised Learning  

Algorithm  Description  Advantages  Disadvantages 
Logistic Regression  An algorithm that models linear relationship between inputs and outputs a categorical variable.  
Linear Regression  An algorithm that models linear relationship between inputs and produces continuous outputs.  
Support Vector Machines  An algorithm that aims to create the best decision boundary to group ndimensional space into different classes.  
Random Forest  It is a combination of many decision trees and is an ensemble learning method.  
Decision Tree  An algorithm that can be used for both regression and classification where models make decision rules on features to obtain predictions.  
KNearest Neighbors  An algorithm that uses feature similarity to predict values of new data points. 
Unsupervised Learning  

Algorithm  Description  Advantages  Disadvantages 
KMeans Clustering  A clustering algorithm that determines K clusters based on euclidean distances.  
Hierarchical Clustering  
DBSCAN  
Apriori Algorithm  
Principal Component Analysis  This algorithm is widely used for dimensionality reduction.  
Manifold Learning  It is used for nonlinear dimensionality reduction and aims to describe datasets as lowdimensional manifolds embedded in highdimensional spaces. 
Deep Learning  

Neural Network  A neural network takes an input, passes it through multiple layers of hidden neurons and outputs a prediction representing the combined input of all the neurons.  
Architectures 
 
LSTM  LSTM is a variant of RNN that is used for learning long term dependencies. It has a memory cell to record additional information.  
Back propagation  A back propagation algorithm consists of two main steps:
 
Gradient descent  
Activation function  
Loss function  
Optimizers   
Regularization  It is a technique for combating overfitting and improving training. Some of them are early stopping, data augmentation and ensembling.  
Layers 

Python Basics  

Concept  Code  Description 
NumPy  
Creating arrays  
Inspecting the array  
Arithmetic Operations  
Aggregate functions  
Subsetting and Slicing  
Array manipulation  
Pandas  
Series  s = pd.Series([3, 5, 7, 4], index=['a', 'b', 'c', 'd'])  A onedimensional labeled array a capable of holding any data type 
DataFrame  data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'BrasÃlia'],
'Population': [11190846, 1303171035, 207847528]} df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])  A twodimensional labeled data structure with columns of potentially different types 
Reading csv files  pd.read_csv('file.csv', header=None, nrows=5)  
Selecting and setting  
Sorting and dropping  
Retrieving basic dataframe information  
Summary of dataframe information 
Statistics and Probability  

Concept  Description  Formula/Graph 
Mean  The mean denotes the average of the group of finite numbers.  
Median  The median denotes the middle of an ordered set of data.  
Mode  Is only relevant for discrete data and is the the most common value occurring in a dataset.  
Variance  Variance gives a measure of the degree to which each value in the population/sample differs from the mean value.  
Standard deviation  The standard deviation tells us how much the values in the sample/ population is spread out from the mean value.  
Covariance  Covariance is used to identify how they both change together and also the relationship between them.  
Correlation  Correlation is dimensionless and is used to quantify the relationship between two variables. It has its range as [1,1].  
Central limit theorem  It states that "As the sample size becomes larger, the distribution of sample means approximates to a normal distribution curve."  
Law of large numbers  The law of large numbers states that As the number of trials or observations increases, the actual or observed average approaches the theoretical or expected average.  
Bayes theorem  
Hypothesis testing  
A/B testing  A/B testing is a famous testing technique used to compare two variants to determine the best of the two based on user experience.  
Confidence intervals  A Confidence interval expresses a range of values within which we are pretty sure that the population parameter lies  
Normal distribution  
Poisson distribution  Distribution that expresses the probability of a given number of events occurring within a fixed time period 
Data visualization  

Chart  Description  Image 
Capturing trends  
Line chart  
Area chart  
Capturing distributions  
Histogram  
Boxplot  Shows the distribution of a variable using 5 key summary statistics.  
Violinplot  
Part towhole charts  
Pie chart  
Donut chart  
Heatmap  
Stacked chart  Compare subcategories within categorical data.  
Visualising relationships  
Bar/column chart  
Scatter plot  
Bubble chart 
Time series analysis  

Concept  Description  Code 
ACF plot  The autocorrelation function (ACF) plot shows the autocorrelation coefficients as a function of the lag.  import statsmodels.api as sm sm.graphics.tsa.plot_acf(data) 
PACF plot  The partial autocorrelation function (PACF) plot shows the partial autocorrelation coefficients as a function of the lag.  import statsmodels.api as sm sm.graphics.tsa.plot_pacf(data) 
ADF test  If a series is stationary, its mean, variance, and autocorrelation are constant over time. We can test for stationarity with augmented DickeyFuller (ADF) test.  from statsmodels.tsa.stattools import adfuller p_value= adfuller(data) 
Time series decomposition  Separate the series into 3 components: trend,seasonality, and residuals  from statsmodels.tsa.seasonal import STL decomp=STL(data,period=m).fit() plt.plot(decomp.observed) plt.plot(decomp.trend) plt.plot(decomp.seasonal) plt.plot(decomp.resid) 
Moving average model â€“ MA(q)  The moving average model: the current value depends on the mean of the series, the current error term, and past error terms.  from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(data,order=(0,0,q)) 
Autoregressive model â€“ AR(p)  The autoregressive model is a regression against itself. This means that the present value depends on past values.
 from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(data,order=(p,0,0)) 
ARMA(p,q)  The autoregressive moving average model (ARMA) is the combination of the autoregressive model AR(p), and the moving average model MA(q).  from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(data,order=(p,0,q)) 
ARIMA(p,d,q)  The autoregressive integrated moving average (ARIMA) model is the combination of the autoregressive model AR(p), and the moving average model MA(q), but in terms of the differenced series.  from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(data,order=(p,d,q)) 
SARIMA(p,d,q)(P,D,Q)_{m}  The seasonal autoregressive integrated moving average (SARIMA) model includes a seasonal component on top of the ARIMA model.  from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(data,order=(p,d,q),seasonal_order=(P,D,Q,m)) 