SHAP and LIME: ML Model Explainability
An XGBoost model achieves AUC 0.91 on validation. In production, unexpected predictions appear — high scores for clearly irrelevant objects. Feature importance from the boosting itself shows top-10 features, but doesn't explain a specific prediction. Why did this specific object get a score of 0.87?
SHAP and LIME answer different versions of this question. It's important to understand when to apply each method and where they break down.
SHAP: Theory and Practice
SHAP (SHapley Additive exPlanations, Lundberg & Lee, 2017) is based on Shapley value theory from cooperative game theory. Idea: the contribution of each feature to the prediction = its average marginal impact across all possible feature coalitions.
Key property: additivity. The sum of SHAP values for all features + base value (average model prediction) = the specific prediction. This is a mathematically exact decomposition, not an approximation.
TreeSHAP — Why Architectural Specialization Matters
For tree-based models (XGBoost, LightGBM, CatBoost, sklearn RandomForest), there's TreeSHAP — an algorithm with polynomial complexity O(TLD²), where T is the number of trees, L is the maximum number of leaves, D is depth. This is orders of magnitude faster than naive KernelSHAP.
import shap
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Waterfall plot for a specific prediction
shap.plots.waterfall(explainer(X_test)[0])
# Summary plot — global importance
shap.summary_plot(shap_values, X_test)
In practice, TreeSHAP on a LightGBM model with 500 trees processes 10,000 examples in 2-3 seconds on CPU. Perfectly acceptable for batch inference.
DeepSHAP and GradientSHAP for Neural Networks
For neural networks, TreeSHAP is inapplicable. We use:
- GradientSHAP (DeepLIFT + SHAP): integrates gradients along the path from baseline to input. Faster than KernelSHAP but requires differentiability.
- KernelSHAP: model-agnostic, works for any black-box model. Slow — explaining one object requires 2^n-2 queries to the model. In practice, sampling is used (nsamples=100-1000).
For BERT and transformers — a separate story. SHAP on transformers via partition explainer works, but latency for explaining one text can reach 30-60 seconds with 512 tokens. For production, a trade-off is usually needed: explanations are generated asynchronously on request.
LIME: Local Approximation
LIME (Locally Interpretable Model-agnostic Explanations, Ribeiro et al., 2016) works differently: around an object, a random cloud of perturbations is generated, predictions are obtained for each from the black-box model, then a simple interpretable model (linear regression or decision tree) is trained on this cloud.
When LIME is better than SHAP:
- Model isn't supported by TreeSHAP and is too slow for KernelSHAP
- Need explanations in terms of "super-pixels" for images (LIME for CV)
- Need text explainability with word highlighting (LIME for NLP)
Stability problem: LIME is a stochastic algorithm. With different random_state values, explanations for the same object can differ significantly. In production, we fix the seed and use a large number of perturbations (num_samples=5000+).
from lime.lime_tabular import LimeTabularExplainer
explainer = LimeTabularExplainer(
X_train.values,
feature_names=feature_names,
class_names=['negative', 'positive'],
mode='classification'
)
explanation = explainer.explain_instance(
X_test.values[0],
model.predict_proba,
num_features=10,
num_samples=5000
)
explanation.show_in_notebook()
Method Comparison
| Characteristic | TreeSHAP | KernelSHAP | LIME |
|---|---|---|---|
| Applicability | Trees only | Any model | Any model |
| Mathematical accuracy | Exact | Exact | Approximation |
| Stability | Deterministic | Deterministic | Stochastic |
| Speed (10k objects) | Seconds | Hours | Minutes |
| Text/image support | No | No natively | Yes |
Integration into Production ML Pipeline
Explanations are needed not only for auditing — they're part of the operational pipeline.
Real case: client is an insurance company, premium calculation model (LightGBM, 120 features). Requirement: agent must be able to explain to customer over the phone why the premium is high.
Solution: TreeSHAP in inference API. For each prediction, top-3 features with highest SHAP values are returned + automatic text template. "Your premium is above average for the following reasons: vehicle age (impact +12%), registration region (impact +8%), payment history (impact +6%)".
Latency overhead: 35ms for TreeSHAP with average inference of 18ms — acceptable.
Monitoring: SHAP values are logged to ClickHouse. Once a week we aggregate — drift in SHAP value distribution signals feature drift earlier than AUC drop.
Limitations to Know About
SHAP ≠ causality. High SHAP value for a feature means correlation with the prediction, not causation. "Feature X impacts prediction" ≠ "changing X will change the result in reality".
Multicollinearity breaks interpretation. If two features correlate (r > 0.8), SHAP divides their influence arbitrarily. When interpreting results, correlation analysis is needed.
For LLMs — both methods give rough estimates. Attention weights are more informative for generation tasks, but also aren't a strict proxy for importance.
Timeline: implementing SHAP/LIME in an existing pipeline — 1-2 weeks. Building a monitoring pipeline with SHAP-based drift detection — 3-4 weeks.







