Data availability is exploding. This article explores the question: “Do we have the right tools in our decision-making tool box to effectively use that data?” Having a working understanding of statistics will help the business decision-maker anticipate risks, manage uncertainty, and improve decision quality. This article addresses business decision-making through a no non-sense statistics, risk, and uncertainty lens. We provide examples along the way of how and when to use and NOT to use statistics.
This article is presented with the following sections:
1. Statistical Moments Background
Many business activities produce data, that then render summary statistics to help make decisions. So, if data is a representation of the business system, then summary statistics are like the all-important Rosetta Stone of that business system. Decision-makers tend to focus on a potentially biased subset of the summary statistics - the data averages. These same decision-makers are more likely to discount or ignore the other statistical “moments.” This may create biasing decision challenges, especially concerning risk management.
Next are the 5 moments associated with probability distributions. We often focus on the average (1st), sometimes the variance (2nd), and rarely the skewness (3rd) or kurtosis (4th).
0th moment - total probability (i.e. one or unity),
1st moment - the expected value
(like an average)
2nd moment - the variance,
3rd moment - the skewness,
4th moment - the kurtosis.
(Please follow the links to brush up on the statistical moments. Knowledge of statistics is helpful but not necessary to appreciate this article.)
All moments are important, with the 3rd and 4th being particularly important to understand the difference between RISK and UNCERTAINTY. I will start with examples highlighting variance challenges. Later, we will explore risk, uncertainty, and the practical implications of Goodhart's law. The article concludes with a few decision-making rules of thumb, including the use of simulations to help manage uncertainty.
2. The Tyranny of the Average
The following are examples of when the average obscures the truth as revealed by the variance and other statistical moments. The examples demonstrate how an average may be deceiving, obscuring risk, and causing misinformed decisions.
Credit scoring and loan pooling
There is a practice of using weighted averages (or “WA”) to estimate the credit risk of loan pools. This is common in loan pooling and securitization sales, such as Residential Mortgage-Backed Securities (RMBS) or Asset-Backed Securities (ABS). In terms of scale and impact, the U.S. RMBS market is one of the largest securitized asset pools on the planet. As of the writing of this article, it maintains an unpaid balance of about $11 Trillion. Understanding credit risk is critical to pricing loan pools, anticipating needed loan servicing operations, reserving for credit losses, and other factors. Here is a quote from the Asset Securitization Report (1):
With the stronger borrower credit profile, COLT 2018-4 features {a big Mortgage Bank's} highest-ever weighted average FICO (724), highest borrower income ($344,782), lowest WA coupon (6.059%) and highest average loan balance ($570,842) than in any of its prior sponsored asset-backed deals.
The weights are for different loan balance levels, with the idea being loss and other economic drivers are based on balance at risk for a given loan unit vs. the individual loan unit alone. This means that a lower risk, high balance loan could contribute less risk to the aggregate pool than a higher risk, low balance loan. FICO score is an industry-standard credit score. The score is a good indicator of individual loan creditworthiness. However, WA FICO may be deceiving. This occurs because the FICO score, like life, is non-linear. The FICO score doesn’t scale consistently from the individual loan borrower to a pool of many loans. PLEASE NOTE: the FICO score rank orders credit risk from high to low. Meaning, a lower FICO score predicts higher credit risk, as compared to a higher score predicting lower credit risk.
To help explain, here is a simple math example of 2 loan pools with 2 loans each, with the same balance (I.e., weighting factor) and different component credit scores:
Pool1 - credit scores are 600 and 800: WA FICO = 700
Pool2 - credit scores are 700 and 700: WA FICO = 700
Since both these pools have the same WA FICO, the 2 pools have the same credit risk, correct? Using your "Statistical Moments" knowledge, the answer is "Not at all!"
There is a tremendous variance difference when relating the pool scores from pool1 (standard deviation = 0) to pool2 (standard deviation = 100). Beyond the impact of the score variance difference, Pool1 has substantially higher credit risk because low scores have disproportionately higher loss odds than high scores. (2) That is, loss odds have a non-linear (a power law) relationship to the score itself. As pointed out in note 2, pool1 has almost double the credit risk as compared to pool2, even though they have the same weighted average FICO. It is possible to renormalize credit risk understanding from the individual to the pool level, it takes more work, with judgment and alternative techniques and measures.
A potentially malicious practice of “pool stuffing” could occur where high-risk loans would be “stuffed” into loan pools that are otherwise lower risk. This occurs because the seller/stuffer knows the WA approach will obscure the higher-risk loans. It could be a way to sell loans that would be otherwise harder to market on their own.
Investment performance benchmarks and manager risk-taking
A related example is provided by economist and the Governor of the Reserve Bank of India Raghuram Rajan (3) in a supporting paper for his book Fault Lines. “The emphasis on relative performance evaluation in compensation creates further perverse incentives. Since additional risks will generally imply higher returns, (investment) managers may take risks that are typically not in their comparison benchmark (and hidden from investors) so as to generate the higher returns to distinguish themselves.”
Similar to the WA FICO example, Rajan suggests investment managers may take non-linear risks and hide them within the comparative performance benchmark. Like WA FICO, the comparative benchmark is like a weighted average. A single higher risk security may maintain the overall investment pool within the benchmark risk profile while providing a non-linear increase in return.
The tyranny of the average highlights an additional risk concern, known as the agency dilemma:
Agency problems are often manifest via the tyranny of the average. In the case of a bank, if they are keeping the loans in their portfolio then their incentives should be aligned. Since the agent and principal are the same, this results in little agency impact. If a bank is selling its loans in securitization structures, the incentives may become misaligned because the seller (the agent) does not hold the credit risk. Thus the buyer (the principal) may be taking risks born from agency misalignment. The agency dilemma was a core driver of the mortgage crisis that started in 2007. As an example, please see this quote from a mortgage crisis legal settlement brief:
"{A Big Bank} employees even referred to some loans they securitized as 'bad loans,' 'complete crap' and ‘[u]tter complete garbage.’"
- U.S. Department of Justice, bolding added
3. Risk and Uncertainty
There is a subtle but high impact difference between risk and uncertainty. This difference is mostly found in the different statistical moments and the degree to which the future is probabilistically available.
The normal distribution is generally associated with a well-behaved variance, no skewness, and very little kurtosis (thickness of tails). This is used to measure RISK. Risk is sometimes described as the “Known Unknown.” That is, while the future may not be certain, the future maybe found on a probability distribution.
In the real world, especially when there is turbulence (like an erratic movement in the stock market or a dynamic credit environment), the distributions are rarely normal. In fact, it should be renamed the abnormal distribution! Calm, reasonably consistent economic environments are the domain of risk. The fast transitioning, recursive, turbulent environments are the domain of UNCERTAINTY. Uncertainty is sometimes described as the “Unknown Unknown.” That is, a future that is not certain, plus, it is difficult to confidently measure a probabilistic future.
Hope for the best (calm), plan for the worst (turbulence).
By definition, turbulent environments create inertia and dependence between observations. Distributions starting as calm, normal distributions may quickly transition to turbulence. As such, kurtosis is usually high and with significant distribution skewing. The normal distribution assumption may be one of the biggest misnomers in the history of decision-making. In a turbulent environment, it is anything but normal! Turbulent environments are a descriptive hallmark of UNCERTAINTY. Through the use of Monte Carlo simulations, the business decision maker may overcome the challenges of turbulence and uncertainty. In the next section, we will cover how averages and distributions may be gamed. Then, we will cover rules of thumb, including the use of simulations, to help the decision-maker in both calm and turbulent environments.
4. Goodhart's law
Goodhart's law is an expression named after British economist Charles Goodhart, who advanced the idea in a 1975 article on monetary policy in the United Kingdom:
"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."
You may ask, what does monetary theory have to do with business decision-making? In this case, very much. Another way to state Goodhart’s observation is,
”When a measure becomes a target, it ceases to be a good measure.”
Goodhart's law relates to common decision-making activities where a measure (like a customer service score) becomes a performance target (like using the score to manage the performance of customer service representatives.) Goodhart's law plays out in many business and educational contexts. The following are two examples:
Example 1: In college, academic performance is chiefly mediated by the grade point average or GPA, thus a measure. Students know that good grades are the ticket to a good job. A good recruiting signal is created by a strong GPA, thus a target. (4) Actually, much has been written on college and employer signaling via the GPA. One particularly intriguing book is by Bryan Caplan called The Case Against Education. Goodhart's law is playing out when students target professors and classes for the "Easy A" instead of focusing on building content knowledge and intellectual rigor. A common practice today is to use services like Rate My Professor to help students cherry-pick Easy A professors. (5)
Example 2: In business, because we measure so much, there are many examples. This example relates to employee performance management. A common employer practice is to use employee productivity measures to drive raises, promotions, and bonus decisions. In some cases, this could drive bad behaviors consistent with Goodhart's law. In the professional services arena, potential bad behaviors related to compensation and promotion may include: a) billing and hour reporting exaggerations, b) staff billing at a high rate while sacrificing personal health or skills training investment, and c) decision-maker over-reliance on measures considered for promotion or compensation changes. Companies generally have policies to control these potential behaviors. The point is, the use of certain measures as targets may create perverse incentives contrary to long-term company health. It is up to company leadership to be vigilant and actively manage the risks associated with these sorts of perverse incentives. It is a slippery slope and hard to control. In the case of compensation and promotion decisions, I question whether the use of measures as hard targets is worth the perverse incentive risks. Especially in a large company. (6)
Goodhart's law often creates incentives for participants to game the measures. As suggested by Goodhart's law, gaming the measure leads to reducing the information value of the measure itself. Both of these examples remind me of the old saying "You get what you pay for." When using measures as targets, it requires thoughtful and consistent management to ensure you do not end up on the wrong side of unintended consequences.
5. Statistical and Simulation-based Decision Making
From my experience, averages are used to drive many business activities, including operations, personnel, marketing, credit risk, market risk, and others. Decision-makers tend to focus on the averages (1st moment), have a vague understanding of the variance (2nd moment), and are pretty oblivious to the 3rd and 4th moments. In the case of Goodhart’s law, the use of certain measurement averages may diminish its value. This is why there tends to be such a surprise when fast market changes occur, uncertainty increases, and goals are not met.
Here are a few rules of thumb for using statistics in your day-to-day decision-making:
Averages used for decision-making should include an associated variance. The higher the variance, the less the reliance (decision weight) should be afforded the average.
If a current distribution is known to be skewed or has thicker than desired tails (kurtosis), it should either not be used or corrected by controlling for those factors. Kurtosis is a big deal with uncertainty. For example, we seem to be surprised by the frequency and severity of economic crises. The reality maybe we are applying "normal" statistical thinking to build our pre-crisis beliefs. Instead, our beliefs should be adjusted by the fact that financial crises are more regular, suggesting a much higher kurtosis and a non-normal distribution.
Any distribution should be evaluated for reasonableness. Does it have enough observations? Does it represent our business reality? Is the measure both accurate and precise enough for reliance? (7) Do we understand the non-linear relationships? Perhaps there is a good reason why a distribution is abnormal.
Care should be taken when using a measure as a target. Consider periodic rotating of measures to protect the measure information value (like how crop rotation is used to protect the nutrient value of farmland) or using more professional judgment to consider several different measures and information sources.
Finally, judgment is key to thinking about future distribution stability. (8) It could be “normal” today, but what if a little heat is applied and uncertainty increases. For example, future “heat” could include 1) a significant shock (pandemic) 2) a general economic downturn or 3) a long-term socio-economic trend changes like environmental (aka, ESG) changes occurring in our society. How stable do you expect the distribution under stress? What are the interactions between stressed decision elements? This is the mother of a normal distribution evolving into an abnormal distribution.
To interpret the graph:
All 4 distributions have an identical average and an identical number of observations. So our ability to draw decision inferences are the same, correct? I know, this is a nonsensical question now that you see the distributions and apply your "Statistical Moments" knowledge. The answer is "No, of course not!"
The normal distribution with the lower variance (SD = 4) is better for making decision inferences than the other higher variance distributions.
The skewed distribution is both skewed and has very thick tails. This should be avoided, if possible.
The Sim is your friend! A helpful and informative countermeasure to understanding potential uncertainty impact is Monte Carlo simulations. Regular development and application of simulations may be especially helpful when using chaotic data and/or when the future has a high degree of uncertainty. The recent pandemic or evolving ESG impacts are certainly good examples of an uncertain environment. Given the speed of business change, our global business interdependence, and our tribal-like cultural inertia, our business environment is increasingly likely to exhibit turbulence and related uncertainty. Please see our article series, Simulation-based Credit Analytics, for a Monte Carlo simulation example. This example shows how a simulation may be used in predicting credit loss, especially when historical data is either unavailable or loses significant predictive power. This framework maybe generally applied to many environments.
Finally, a shout out to The Economist publication - they regularly publish the 2nd moment in the form of a confidence band on many of their relevant line charts and graphs. This is a best practice.
Conclusion
This article considered business decision-making through a no non-sense statistics, risk, and uncertainty lens. We provided examples and related content along the way. This includes how averages may obscure significant risk challenges, the nuanced but significant differences between risk and uncertainty, plus some current examples related to Goodhart's law. Finally, we provided several rules of thumb and a simulation solution to help you operationalize using statistics in your day-to-day decision-making.
Truly, the numbers don't lie. With some practice, you will get the hang of interpreting the numbers' story. The effort is its own reward to make your business statistics a high-impact reflection of your business reality.
Notes
(1) To be clear, this is not to say the risk assertion made about the mortgage bank's loan securitization is wrong. The point is, there is additional statistical information needed to confirm, especially when comparing to other loan pools with similar WA characteristics. Just like you would not just consider mileage when buying a used car. At a minimum, I suggest a standard deviation should be disclosed with the weighted average. It would not hurt to have measures of skewness and kurtosis to be provided as well. However, there is no substitute for loan-level due diligence.
(2) The difference by score band is non-linear. In fact, score providers typically use logarithmic scaling (ln) by design. Specifically, higher score bands generally have higher percent loss odds changes and lower absolute loss odds changes from one band to the next. The opposite is true for lower scores. That is, lower score bands generally have lower percent loss odds changes, and higher absolute loss odds changes from one band to the next. In our 2 pool example, the loss odds difference is almost double for pool1 than pool2. This is based on the “bad” odds presented on page 16 in the FICO Score Validation Guide.
You may wonder why anyone would use a weighted average given the double whammy of potential variance differences and non-linear credit score scaling. I have 2 possible answers. 1) Weighted average is intuitive, plus easy to calculate and report. 2) a weighted average is just one of many measures a more sophisticated buyer will use to understand a loan pool. Just like one would not only consider mileage to measure the performance of a car.
(3) I am a big fan of Dr. Rajan. His was a contrary voice in the central banking, lemming-like "drumb" beat of the time. He persistently raised awareness of the problems leading to our financial system collapse. In 2005, Dr. Rajan said the financial system was at risk “of a catastrophic meltdown.”
(4) For a more in-depth discussion on college signaling, please see the "College Arbitrage" section of our article: The Stoic’s Arbitrage: A survival guide for modern consumer finance products
(5) The ability to game a GPA is more likely in larger schools with multiple sections of the same required class. It is less likely in smaller schools or more advanced programs with limited or single section classes. For example, James Madison University has a STEM program called Quantitative Finance. It graduates 10-15 students per year. It is highly rigorous with limited class section selection. (Full disclosure, I am on the advisory board of JMU's College of Business Finance program.)
In case you doubt the college signaling thesis, here is a college signaling observation I first read in Caplan's book. Today, a common practice is for some colleges to allow virtually anyone to audit college classes. That is, at some notably prestigious colleges, someone can sit in on the class for free or almost free.
“Despite most universities restricting online auditing, there are still free courses available for auditing through online platforms that have partnered with universities. One such platform is edX, a nonprofit co-founded by Harvard and MIT. Through edX, students can audit courses from UC Berkeley, the University of Texas, Cornell, Dartmouth, and CalTech, to name a few.”
However, when one audits the class, while they will receive the educational content, they will not receive degree eligible grades. The point being, if the primary economic good of the college is the transfer of knowledge from professor to student, why would the college give away the economic good? This is because, the actual primary economic good (especially at notably prestigious colleges) is the college signal, as represented by the grades, the GPA, and the diploma. The primary economic good is NOT the transfer of knowledge. While a college may give free access to classroom content, it will withhold a graduating student's diploma and transcripts until all fees have been paid.
(6) Recently, there has been a focus on the psychology of scarcity and how it impacts people. Sendhil Mullainathan and Eldar Shafir recently published the book Scarcity: Why Having Too Little Means So Much. In the context of professional services companies and potential bad behaviors, billing too much can have a negative impact from an employee poverty standpoint. When we think of poverty, we think of not having enough money. However, the general definition of poverty is not having enough of a needed resource. (I.e. Resource Scarcity) Besides money, another very important resource is time. So if an employee is encouraged to bill at very high levels, in effect, the policy could create time impoverishment. Meaning, because the employee is so focused on current client delivery, they will not have enough time to train, think creatively, develop new solutions, work on higher education, etc. The lack of time will create hyper-focus on this scarcity and drives general unhappiness. Nobel laureate Daniel Kahneman's perspective is:
“Money does not buy you happiness, but lack of money certainly buys you misery.”
One can insert "time" for "money" in this quote. In effect, by encouraging high bill rates, the company would be creating poverty-induced unhappiness for the harried employee and reducing the investment value the employee can bring to the company. To be fair, in the short term, scarcity creates focus by capturing the mind. This maybe ok as long as scarcity is only for the short term. However, over a longer period of time, the focus is known to devolve to stress and unhappiness. This may lead to lower employee productivity and attrition. From a solution standpoint, a couple of ideas come to mind:
Have a maximum billing utilization (maybe 85% of a 40 hour week). The remaining time is for other employee development. The time period for the measurement is important. 1 month of being utilization overdrawn maybe short term. However a full calendar quarter of being utilization overdrawn maybe considered long term. To put teeth in it, if an employee goes over, the leadership of all projects the employee is working on will receive a financial penalty and a reprimand.
Capital One Bank has a policy where every other Friday is a no meeting day. Capital One is known for its prowess in testing and implementing customer or employee based solutions based on behavioral testing results. For this policy, the employees are encouraged to use this time for personal development. The employee is given time flexibility, thus decreasing scarcity from a lack of time.
(7) It is helpful to understand the subtle differences between accuracy and precision. If you follow the link, the bullseye metaphor is a great teacher! Also, accuracy and precision are systematically related to bias and noise. For example, if you hear someone say "so and so is directionally correct," they mean so and so is generally accurate/unbiased but may lack precision / is noisy. Also, if you hear someone claiming "so and so measure is biased," they mean so and so lacks accuracy. "Moving the goalposts" describes a common logical fallacy. This is where one changes the standard (goal) of a process while it is still in progress, in such a way that the new goal offers one side an advantage or disadvantage. In effect, the original goal was met with precision but was not accurate. Moving the goalpost may create the appearance of bias, though the bias occurs from changing the standard, not necessarily the outcome of a process.
(8) Using traditional statistical modeling techniques, modelers may encounter two significant and reasonably common problems:
The problem of Dependence - also known as multicollinearity. This is where the so-called independent variables (or features) are not as independent as they should be to rely on the modeling outcome.
The problem of Inertia - also known as (serial correlated) heteroskedasticity. This is where the errors associated with modeled outcomes relate to each other.
By the way, I have no idea why cryptic words like multicollinearity or heteroskedasticity are used in the first place! Maybe someone was getting paid by the syllable …
Statisticians/data scientists certainly have tools to solve these problems ("solution tools"), but this is where judgment comes in. The decision-maker needs to understand why these problems came up in the first place and the extent to which the solution tools are misaligned with the reality they wish to model.
Misapplying solution tools may create an even bigger and harder-to-detect problem: a model stability problem created by model overfit. This occurs when the current model may "fit" the current data set from a technical standpoint, but it is less stable when applied to future populations. Also, model fit problems relate to the difference between correlation and causation. A long-term stable model is more likely to represent causal outcomes. Whereas overfit may be symptomatic of correlation that lacks causation. In my opinion, one of the biggest problems in data science today is a lack of causal clarity.
Personally, I tend to be conservative. Meaning, I prefer a statistical model that may have less current explanatory power as traded off with better long-term explanatory stability.
תגובות