[Paper Review] Statistical Modeling : The Two Cultures , Leo Breiman

5 min readJan 16, 2023

My team consider ‘Paper Review’ as a project to develop one’s competency. So, each week I’m going to upload at least one paper review on the Medium that a post title starts with ‘[Paper Review]’.

The form will be as follows;

Paper Title
Author
Publication
Citation
The proposed
Analysis
Review
References

Paper Title | Statistical Modeling : The Two Cultures

Author | Leo Breiman (Statistician @ Univ. of California, Berkeley)

Publication | Institute of Mathematical Statistics

Citation | 4,874 (Google)

“Statistical Modeling: The Two Cultures” is a book written by Leo Breiman, a statistician and computer scientist, in which he argues that there are two cultures in statistical modeling: one that emphasizes the use of sophisticated mathematical techniques and another that emphasizes the use of simple, interpretable models. The book is a collection of Breiman’s essays and articles on the subject, in which he argues that the second culture, which emphasizes the use of simple models, is more effective and should be given more attention by statisticians and data scientists. The book is considered as a classic in the field of statistics and machine learning.

Proposed | It introduces two positions on modeling. In both positions, the author experiences and explains the disadvantages of the probability statistical model and the advantages of the machine learning model in comparison.

Analysis | The author divides real-life data into probability models or algorithm models.

The above box is about probability models, including linear regression, logistic regression, and Cox model, etc. The bottom box is about algorithm models which contain decision tress and neural nets that the inside of the box is considered as complex and unknown.

Problems in Current Data Modeling

First one we do model validation by a goodness-of-fit and a residual examination while the second one do by a predictive accuracy. The goodness-of-fit tests have very little power unless the direction of the alter- native is precisely specified. The implication is that omnibus goodness-of-fit tests, which test in many directions simultaneously, have little power, and will not reject until the lack of fit is extreme.

“The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data.”

“If all a man has is a hammer, then every problem looks like a nail.”

The Multiplicity of Data Models

The greatest plus of data modeling produce a simple and understandable picture of the relationship between the input variables and response. Reason of Multiplicity is lacking power of goodneww of fit, residual analysis.

More complicated data models are appearing in current published applications. Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over. This may signify that as data becomes more complex, the data models become more cumbersome and are losing the advantage of presenting a simple and clear picture of nature’s mechanism.

Algorithmic Modeling

Algorithm models make such assumption that Nature produces data in a black box whose insides are complex, mysterious, and partly unknowable that the data is drawn i.i.d. from an unknown multivariate distribution.

Rashomon and the Multiplicity of Good Models (A Rashomon effect)

The Rashomon effect refers to the inability of suspects to resolve a case by taking different positions on one case. It refers to different models with similar error rates in this paper.

Close to each other in terms of error vs. Distant in terms of the form of the model

Slight perturbation or deleting unimportant variables
Aggregating over a large set of competing models can be solution

Occam and Simplicity vs. Accuracy

It refers that if you have the same effect, a less complex model is a good model. However, the author says that accuracy and simplicity are at odds with each other, so prioritizing simplicity can reduce the accuracy of the model.

Growing Forests for Prediction

Among the author’s projects, modeling of the court sentence time using a simple tree model is exemplified.
When looking at the model, the interpretability was good, but the prediction accuracy was poor. As a measure, the Forest technique was proposed. It is mentioned that creating a forest for a slightly modified training set and ensembling it can increase the prediction accuracy of the model.

Growing forest by perturbing the training set
Growing a tree on the perturbed training set, perturbing the training set again, growing another tree
bagging, bossting, arcing, additive logistic regression

Bagging and Boosting | Spreadsheet, Robot and Idea icons by Freepik on Flaticon

As you can see the bottom table, using the forest technique, the test error can be reduced by half to one-third compared to when a single tree is applied. So, the author said “We need complex prediction models.”

Bellman and the Curse of Dimensionality

The curse of dimensionality is that learning is difficult when the dimension of data increases, so it is dangerous in data analysis. But the author considers it as a blessing.

Reducing the dimension is equivalent to reducing the amount of information available for prediction. Instead of reducing the dimension, we restrict the addition of functions consisting of predictors.

The Shape Recognition Forest

Shallow trees are grown
At each node, 100 features are chose at random form the appropriate level of the hierarchy
Optimal split of the node based on the selected features is found

Support Vector Machine (SVM)

Optimal is defined as meaning that the distance of the hyperplane to any prediction vector is maximal. Adding a function consisting of predictors allows the SVM to work well in multiple dimensions, increases prediction accuracy, and lowers error rates.

Information from a Black Box

Higher predictive accuracy is associated with more reliable information about the underlying data mechanism.
Algorithmic models can give better predictive accuracy than data models, and provide better information about the underlying mechanism.

Focus on solving the problem instead of asking what data model they can create.

Review | It was surprising that the methods proposed by the author in 2001 were now many ways. The author says not to find where the data came from at the end, but to try to find a methodology to solve the problem. I agree with this, and efforts should be made to apply various techniques without being buried in a statistical perspective.

References.|

Breiman, Leo. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical science 16.3 (2001): 199–231.

[Paper Review] Statistical Modeling : The Two Cultures (https://www.youtube.com/watch?v=lS6KqOqx6bc)