# Sentiment Analysis with Logistic Regression¶

This gives a simple example of explaining a linear logistic regression sentiment analysis model using shap. Note that with a linear model the SHAP value for feature i for the prediction $f(x)$ (assuming feature independence) is just $\phi_i = \beta_i \cdot (x_i - E[x_i])$. Since we are explaining a logistic regression model the units of the SHAP values will be in the log-odds space.

The dataset we use is the classic IMDB dataset from this paper. It is interesting when explaining the model how words that are absent from the text are sometimes just as important as those that are present.

In [1]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
import shap

shap.initjs()


In [2]:
corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)

vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)
X_test = vectorizer.transform(corpus_test)


## Fit a linear logistic regression model¶

In [3]:
model = sklearn.linear_model.LogisticRegression(penalty="l1", C=0.1)
model.fit(X_train, y_train)

Out[3]:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

## Explain the linear model¶

In [4]:
explainer = shap.LinearExplainer(model, X_train, feature_dependence="independent")
shap_values = explainer.shap_values(X_test)
X_test_array = X_test.toarray() # we need to pass a dense version for the plotting functions


### Summarize the effect of all the features¶

In [5]:
shap.summary_plot(shap_values, X_test_array, feature_names=vectorizer.get_feature_names())


### Explain the first review's sentiment prediction¶

Remember that higher means more likely to be negative, so in the plots below the "red" features are actually helping raise the chance of a positive review, while the negative features are lowering the chance. It is interesting to see how what is not present in the text (like bad=0 below) is often just as important as what is in the text. Remember the values of the features are TF-IDF values. It is interesting that "and" is the most important feature of the text, perhaps because it captures some high level notion of the text structure (having lots of "and"s apparently indicates a more positive review).

In [6]:
ind = 0
shap.force_plot(
explainer.expected_value, shap_values[ind,:], X_test_array[ind,:],
feature_names=vectorizer.get_feature_names()
)

Out[6]:
Visualization omitted, Javascript library not loaded!
Have you run initjs() in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [7]:
print("Positive" if y_test[ind] else "Negative", "Review:")
print(corpus_test[ind])

Positive Review:



### Explain the second review's sentiment prediction¶

In [8]:
ind = 1
shap.force_plot(
explainer.expected_value, shap_values[ind,:], X_test_array[ind,:],
feature_names=vectorizer.get_feature_names()
)

Out[8]:
Visualization omitted, Javascript library not loaded!
Have you run initjs() in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.