Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
The process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system is called Causal inference. Usual statistical methods like correlation does not ensures causality. That's why we need a more scientific method to ensure causation.
Table of Contents:
- Causal reasoning
- Methodology
- Experimental Method
- Quasi-experimental Method
- Implementation in python
- Real life Applications
Causal reasoning:
Causal reasoning is the process of identifying causality: the relationship between a cause and its effect. The study of causality was originated in ancient philosophy. Then It also discussed in neuropsychology by correlating the changes in the brain with behavioral outcomes. Assumptions about the nature of causality, may be shown to be functions of a previous event preceding a later one. The first study of cause and effect occurred in Aristotle's Physics. Causal inference is an example of causal reasoning.
Methodology
In Causal Inference, the measure of one variable is suspected to affect the measure of another variable in a system. The first step is to formulate a falsifiable null hypothesis, which will be tested with statistical methods. Probability of that null hypothesis is true is to be calculated. Bayesian inference is used here to determine the effect of an independent variable.
Suppose, a chemist invented a new drug to cure a disease. Now there are four catagories
- The patient consumed the drug and cured.
- The patient consumed the drug but not cured.
- The patient didn't consume the drug but cured.
- The patient didn't consume the drug and not cured.
There are two Uplift Modelling Meta-Learning Techniques
1. Two-Model Approach
Now, Individual Treatment Effect
ITE = probability of cured with drug - probability of cured without drug
= P(C/D) - P(C/D^c)
= P(Cured/Drug is implemented) - P(Cured/Drug is not implemented)
= P(Y<i>=(1/X<i>), W<i>=1) - P(Y<i>=(1/X<i>), W<i>=0)
Y<i> = probability of getting cured for an individual.
X<i> = Lead Vector, number represents characteristics.
W<i> is a variable that indicates if the patient consumed the drug or not.
2. Class Transformation Approach
Now, Individual Treatment Effect
ITE = probability of cured with drug + probability of not cured without drug
= P(Y<i>=1, W<i>=1)+P(Y<i>=0, W<i>=0)
= 2.P(Z<i>=(1/X<i>)) - 1
Z<i> = Y<i>.W<i> + (1-Y<i>).(1-W<i>)
Y<i> indicates if the patient is cured or not.
W<i> indicates if the patient consumed the drug or not.
Randomized control test can be used to find the answers of some questions.
Experimental Method
There are experimental methods to verify of causal mechanisms. Suppose, By keeping the other experimental variables constant if we manipulate the variable of interest, A and find that another variable, B is also changing then A is called Independent variable and B is called Dependent variable. If A have statistically significant effect on B then it is considered as a causal effect.
Quasi-experimental Method
When traditional experimental methods are unavailable, infeasible or illegal Quasi-experimental verification is used. In this method researchers collect the data before the change of A (Independent variable) and after the change of A and work with the collected data for verification of causality.
Implementation in python
In Python, the DoWhy python library is used to do causal inference.
DoWhy performed this into 4 steps - Modeling, Identification, Estimation, Refutation. Here is an example of evaluation the impact of a signup program on customer spending behavior over time.
Creating a dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
!pip install dowhy
import dowhy
from dowhy import CausalModel
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
num_users = 10000
num_months = 12
signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0,2, size=num_users)
signup_months == 0 means customer did not sign up
df = pd.DataFrame({
'user_id': np.repeat(np.arange(num_users), num_months),
'signup_month': np.repeat(signup_months, num_months),
'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12
'spend': np.random.poisson(500, num_users*num_months) #np.random.beta(a=2, b=5, size=num_users * num_months)*1000 # centered at 500
})
A customer is in the treatment group if and only if they signed up
df["treatment"] = df["signup_month"]>0
Simulating an effect of month (monotonically decreasing--customers buy less later in the year)
df["spend"] = df["spend"] - df["month"]*10
Simulating a simple treatment effect of 100
after_signup = (df["signup_month"] < df["month"]) & (df["treatment"])
df.loc[after_signup,"spend"] = df[after_signup]["spend"] + 100
df
This code simulates a dataset to analyze the causal impact of a signup program on customer spending over time. By defining a causal graph, it models the relationships between treatment (signup), pre-signup spends, and post-signup spends. Using DoWhy, the code identifies and estimates the causal effect, showing that signing up increases spending. A placebo test confirms the robustness of this effect, indicating the observed increase in spending is likely due to the program rather than random chance. This approach effectively demonstrates how to use causal inference techniques to evaluate program impacts.
I. Model a causal problem
i = 3
causal_graph = """digraph {
treatment[label="Program Signup in month i"];
pre_spends;
post_spends;
Z->treatment;
pre_spends -> treatment;
treatment->post_spends;
signup_month->post_spends;
signup_month->treatment;
}"""
Post-process the data based on the graph and the month of the treatment (signup) For each customer, determine their average monthly spend before and after month i
df_i_signupmonth = (
df[df.signup_month.isin([0, i])]
.groupby(["user_id", "signup_month", "treatment"])
.apply(
lambda x: pd.Series(
{
"pre_spends": x.loc[x.month < i, "spend"].mean(),
"post_spends": x.loc[x.month > i, "spend"].mean(),
}
)
)
.reset_index()
)
print(df_i_signupmonth)
model = dowhy.CausalModel(data=df_i_signupmonth,
graph=causal_graph.replace("\n", " "),
treatment="treatment",
outcome="post_spends")
model.view_model()
II. Identify a target estimand under the model
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
III. Estimate causal effect based on the identified estimand
estimate = model.estimate_effect(identified_estimand,
method_name='backdoor.propensity_score_matching',
target_units='att')
print(estimate)
IV. Refute the obtained estimate
refutation = model.refute_estimate(identified_estimand, estimate, method_name='placebo_treatment_refuter',
placebo_type='permute', num_simulations=20)
print(refutation)
This code simulates a dataset to analyze the causal impact of a signup program on customer spending over time. By defining a causal graph, it models the relationships between treatment (signup), pre-signup spends, and post-signup spends. Using DoWhy, the code identifies and estimates the causal effect, showing that signing up increases spending. A placebo test confirms the robustness of this effect, indicating the observed increase in spending is likely due to the program rather than random chance. This approach effectively demonstrates how to use causal inference techniques to evaluate program impacts.
Real life Applications
In Etiology, it is used to find the correct reason of a disease among multiple factors like pathogen, a perticular gene trait or chemical substrates. Award-winning computer scientist and philosopher Judea Pearl first discussed The concept of causal AI and the limits of machine learning in 2018's "The Book of Why: The New Science of Cause and Effect. In 2020, Columbia University established a Causal AI Lab for research on causal and counterfactual inference under Director Elias Bareinboim to apply in data-driven fields like the health sector, social sciences or consulting firms. On a conceptual level, the idea is to factorize the joint distribution P(Cause, Effect) into P(Cause) * P(Effect | Cause) rather than the factorization into P(Effect) * P(Cause | Effect) to reduce it's complexities. A different family of methods are used to discover causal "footprints" from large amounts of labeled data, and allow the prediction of more flexible causal relations. One practical use of causal AI in organisations is to explain decision-making and the causes for a decision.