{"id":1949,"date":"2026-04-29T09:38:34","date_gmt":"2026-04-29T09:38:34","guid":{"rendered":"https:\/\/inphronesys.com\/?p=1949"},"modified":"2026-04-29T09:45:03","modified_gmt":"2026-04-29T09:45:03","slug":"the-folly-of-forecasting","status":"publish","type":"post","link":"https:\/\/inphronesys.com\/?p=1949","title":{"rendered":"The Folly of Forecasting"},"content":{"rendered":"<h2>The Forecasting Urge Is Older Than Science<\/h2>\n<p>The wealthiest monarch in the ancient world consulted the most reliable oracle in the ancient world, received a perfectly accurate prophecy, and was destroyed by it within two years. The story is from Herodotus, but the failure mode is in every supply chain, every demand plan, and every quarterly earnings call published this week.<\/p>\n<p>Around 560 BC, Croesus of Lydia tested seven oracles to determine which was most trustworthy. Delphi passed. So he asked the question that had kept him awake at night: should he march his armies east against the Persian empire of Cyrus the Great? The Pythia answered, as she always did, in hexameter verse. Herodotus records it in <em>Histories<\/em> Book 1.53: if Croesus attacked Persia, a great empire would fall.<\/p>\n<p>He took this as a green light. He marched. The empire that fell was his own. When his envoys returned to reproach the oracle, the priests at Delphi had their answer ready \u2014 recorded in Book 1.91 of the same history. The god had been perfectly accurate. Croesus simply failed to ask <em>which<\/em> empire.<\/p>\n<p>The story is three things at once: a geopolitical disaster, a cautionary tale about motivated reasoning, and an early case study in forecast design. The Pythia&#8217;s oracle was not a prediction. It was a conditional probability statement with plausible deniability built in. Croesus collapsed that probability distribution into a single confident action \u2014 and was destroyed by the tail he had ignored.<\/p>\n<p>The urge Croesus was acting on is not ancient history. It runs through every enterprise planning cycle, every demand forecast, every political pundit on television. Humans do not tolerate uncertainty well. They pay for its removal \u2014 which is why oracles, astrologers, economists, and management consultants have always found customers.<\/p>\n<p>But the Egyptian priests at the Nilometers were doing something categorically different from the Pythia. The nilometer at Elephantine in Aswan and the one at Roda Island \u2014 rebuilt in its current form in AD 861 \u2014 were not prophecy. They were measurement instruments. Priests recorded flood levels annually, calibrated those measurements against harvest yields and grain prices, and built empirical tables linking Nile height to expected productivity. They were practicing data-driven forecasting twelve centuries before the term existed.<\/p>\n<p>The contrast matters. Delphi was prediction-as-theater: unfalsifiable, structured to survive being wrong. The nilometers were forecasting-as-science: measurement, feedback, calibration, revision. Both traditions have their modern descendants. One of them is still practiced in most supply chain departments worldwide. You can probably guess which one.<\/p>\n<p>The question the April 2026 forecasting sprint on this blog has been building toward is not <em>can we forecast<\/em> \u2014 humans always have, and always will. The question is <em>which forecasts deserve our trust<\/em>, on what evidence, and what we should do differently when the answer is uncomfortable.<\/p>\n<hr \/>\n<h2>The Science: When Forecasting Actually Got Better<\/h2>\n<p>In 1963, an MIT meteorologist named Edward Lorenz published a paper with one of the most quietly devastating titles in scientific history: &#8222;Deterministic Nonperiodic Flow&#8220; (<em>Journal of the Atmospheric Sciences<\/em>, 20, 130\u2013141). Working with a simplified set of equations representing atmospheric convection, Lorenz discovered that his numerical model produced completely different long-range outcomes from nearly identical starting conditions. A rounding difference of 0.000127 in one variable \u2014 the kind of error that seemed numerically trivial \u2014 compounded until the predicted weather bore no resemblance to the control run.<\/p>\n<p>This is the ceiling. Not a practical limitation to be engineered away, not a problem of insufficient computing power. A mathematical ceiling. In chaotic systems, small errors in initial conditions grow exponentially. Beyond a certain horizon \u2014 roughly ten days for the atmosphere \u2014 deterministic prediction becomes impossible in principle, not just in practice. No amount of intelligence, human or artificial, breaks Lorenz&#8217;s result.<\/p>\n<p>The implications are specific, not general. Lorenz&#8217;s finding applies to systems with sensitive dependence on initial conditions. It does not mean all forecasting is hopeless. It means the theoretical horizon is bounded and domain-specific. To understand how bounded, look at what actually happened to weather forecasting in the decades after 1963.<\/p>\n<p><strong>The quiet revolution.<\/strong> In 2015, Peter Bauer, Alan Thorpe, and Gilbert Brunet published a landmark review in <em>Nature<\/em> documenting what they called &#8222;the quiet revolution of numerical weather prediction&#8220; (Bauer, Thorpe &amp; Brunet, 2015, <em>Nature<\/em> 525, 47\u201355, doi:10.1038\/nature14956). The headline finding: operational weather forecasting had gained roughly <strong>one extra day of skillful forecast lead time per decade since the 1980s<\/strong>. A 6-day forecast in 2014 was approximately as accurate as a 4-day forecast in 1981. The skill improvement was steady, reproducible, and grounded in verifiable verification statistics.<\/p>\n<p>That is not a metaphor. That is a measured, peer-reviewed, decades-long improvement in a capability that matters enormously: knowing whether to evacuate a coastline, route shipping around a storm, or ground aircraft four days in advance instead of two.<\/p>\n<p>How did this happen \u2014 in the presence of a theoretical ceiling?<\/p>\n<p>Four conditions, all of which happened to hold simultaneously in numerical weather prediction, explain why this domain got tractable while others didn&#8217;t:<\/p>\n<p><strong>1. Dense observations.<\/strong> The atmosphere is now observed by a global network of weather stations, radiosondes, ocean buoys, commercial aircraft transponders, and \u2014 critically \u2014 a fleet of polar-orbiting and geostationary satellites providing continuous global coverage. The ERA5 reanalysis dataset alone covers the entire globe at hourly resolution from 1940 to the present. Sparse input data is the death of any forecast; the atmosphere is now among the most densely observed systems humans have ever measured.<\/p>\n<p><strong>2. Strong physical priors.<\/strong> The atmosphere obeys the Navier-Stokes equations, the laws of thermodynamics, and the equations of radiative transfer. These are not empirical regularities that might change next quarter \u2014 they are the same physics that governed the atmosphere before humans existed. This means that a model trained on 1985 observations can be expected to perform on 2025 conditions, because the underlying dynamics haven&#8217;t changed. Most social and economic systems have no equivalent.<\/p>\n<p><strong>3. Fast, clean feedback loops.<\/strong> A weather forecast is scored against reality within hours or days. If yesterday&#8217;s 5-day forecast was wrong, you know by tomorrow. This feedback compresses the error-correction cycle in a way that is impossible in, say, long-range macroeconomic forecasting, where you might wait years to learn whether a prediction was right \u2014 and even then dispute the counterfactual.<\/p>\n<p><strong>4. No reflexivity.<\/strong> The atmosphere does not read weather forecasts. Publishing a prediction for Friday&#8217;s storm does not cause people to move to higher ground, which changes where the storm makes landfall, which invalidates the original prediction. Supply chain demand forecasts, election forecasts, and financial market forecasts are all contaminated by reflexivity to varying degrees. Weather is not.<\/p>\n<p>When these four conditions hold, forecasting is genuinely a science, capable of steady, measurable, decade-over-decade progress. The Bauer 2015 result is what scientific forecasting looks like.<\/p>\n<p>When they don&#8217;t hold \u2014 and we will spend the next section establishing how many domains don&#8217;t \u2014 you get something different.<\/p>\n<p><strong>The probabilistic turn.<\/strong> It is worth pausing on one further element of the weather forecasting success story that has direct implications for practitioners in other domains: the shift from deterministic point forecasts to calibrated ensemble forecasts. Lorenz&#8217;s result didn&#8217;t just impose a ceiling \u2014 it provided a diagnostic. If small perturbations in initial conditions cause large divergence in outcomes, then running the model multiple times with slightly varied starting states reveals the distribution of plausible futures rather than hiding uncertainty behind a single track. ECMWF&#8217;s operational ensemble system (ENS) runs 51 independent members twice daily. When the ensemble spread is tight, the atmosphere is in a predictable state and confidence is warranted. When the spread is wide, the atmosphere is genuinely chaotic at the relevant horizon and even the best model can tell you little about which specific outcome will materialize \u2014 but the wide spread itself is useful information. A logistics manager who sees a wide ensemble spread on Friday&#8217;s temperature forecast knows to maintain flexible staffing rather than pre-committing to a single scenario.<\/p>\n<p>This is the calibration ideal that the best human forecasters later reproduced. Tetlock&#8217;s superforecasters, the Good Judgment Project, and Kahneman&#8217;s noise framework all converge on the same point: explicit uncertainty quantification is not a sign of forecasting weakness. It is the signature of forecasting competence. A point forecast with no confidence interval is almost always less honest \u2014 and less useful \u2014 than a wide probability distribution delivered with clear acknowledgment of its limitations.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_ecmwf_skill_evolution-3.png\" alt=\"ECMWF skill score evolution showing roughly 1 day of skill gained per decade \u2014 illustrative, based on Bauer, Thorpe &amp; Brunet (2015) and the ECMWF skill score archive\" \/><\/p>\n<hr \/>\n<h2>The Persistent Failures: When Forecasting Stays Bad<\/h2>\n<p>In 1987, Philip Tetlock began the study that would become the most important empirical investigation of expert forecasting ever conducted. Over twenty years, he tracked the predictions of 284 experts \u2014 political scientists, economists, foreign-policy analysts \u2014 accumulating more than <strong>80,000 forecasts<\/strong> across domains ranging from elections to economic crises to geopolitical conflicts. He published the results in 2005 as <em>Expert Political Judgment: How Good Is It? How Can We Know?<\/em> (Princeton University Press).<\/p>\n<p>The finding is now famous, although the famous version overstates it. The popular soundbite says experts performed worse than dart-throwing chimpanzees; Tetlock has spent the years since repeatedly correcting that. The accurate version is harder to dismiss and more useful: expert accuracy was not reliably distinguishable from chance \u2014 roughly as accurate as a dart-throwing chimpanzee, in the popular paraphrase Tetlock himself never quite endorsed. The credentials did not help. The television appearances did not help. If anything, they hurt: the most famous experts, those most in demand by broadcasters precisely because they expressed confident, memorable views, performed worst of all.<\/p>\n<p>But two details from Tetlock&#8217;s data are often missed. First, the performance gap between good and bad forecasters was not random: it correlated reliably with epistemic style. Hedgehogs \u2014 those who organized their worldview around one big idea and applied it confidently \u2014 lost systematically to foxes, who drew on many small frameworks, quantified their uncertainty, and updated willingly when evidence contradicted them. Confidence was not a predictor of accuracy; calibration was.<\/p>\n<p>Second, and critically for practitioners: the failure was not evenly distributed across domains. Political forecasting, long-range economic forecasting, and social trend prediction all sat at the chimp-level end of the spectrum. These are exactly the domains that violate the four conditions outlined earlier. For a deeper look at the forecasters who do manage to beat the baseline \u2014 and why \u2014 see the post in this series on <a href=\"https:\/\/inphronesys.com\/?p=1896\">The 20 Most Influential People in Forecasting<\/a>, which profiles both Tetlock&#8217;s superforecasters and the scholars who explain why most experts fail. For the specific failure mode of over-relying on algorithmic outputs in supply chain contexts, see <a href=\"https:\/\/inphronesys.com\/?p=1886\">When the Algorithm Is Wrong and the Expert Is Right<\/a>.<\/p>\n<p><strong>The M5 benchmark.<\/strong> The M-competitions, organized by Spyros Makridakis across four decades, are the most rigorous empirical test of forecasting methods on real demand data. The M5 competition, run on 42,840 hierarchical Walmart time series in 2020, produced a conclusion that supply chain practitioners often find uncomfortable: only <strong>35.8% of the 5,507 participating teams beat Seasonal Naive<\/strong> \u2014 a baseline that requires no statistical training, no model fitting, and no hyperparameter tuning. The full breakdown, including the distribution of methods and why simple beats fancy in expectation, is covered in <a href=\"https:\/\/inphronesys.com\/?p=1874\">The M5 Lesson: Why Simple Still Beats Fancy in Supply Chain Forecasting<\/a>.<\/p>\n<p>The M5 result is not a fluke. Makridakis has now documented this pattern across four competitions and multiple decades. The average professional demand forecast, using methods that were carefully selected and tuned by practitioners, loses to a naive baseline that a first-year student could implement in twenty minutes.<\/p>\n<p><strong>The AI baseline test.<\/strong> This April, we ran a direct comparison that produced an even sharper illustration. A classical Vector Autoregression model \u2014 VAR, a statistical framework that dates to the early 1980s \u2014 was tested against TimesFM, Google&#8217;s large foundation model for time series forecasting, on a realistic supply chain demand dataset. The result: <strong>VAR achieved RMSE 8.76 versus TimesFM&#8217;s RMSE 14.54 at the T+1 horizon<\/strong> \u2014 TimesFM performing 66% worse. The full methodological breakdown, including the code, is in <a href=\"https:\/\/inphronesys.com\/?p=1916\">Why VAR Beat Google&#8217;s TimesFM \u2014 and How to Build One in R<\/a>.<\/p>\n<p>The reasons these domains fail where weather succeeds are not incidental. They are structural, and they map directly onto the four conditions:<\/p>\n<ul>\n<li><strong>Observations are sparse and biased.<\/strong> Retail point-of-sale data is contaminated by promotions, stockouts, and substitution. Political polls undersample populations by definition. Financial market microstructure data is noisy with short-term liquidity dynamics that have nothing to do with long-run value.<\/li>\n<li><strong>Physical priors are weak or absent.<\/strong> Human purchasing behavior, unlike atmospheric convection, has no governing equation derived from first principles. Structural models exist, but they are approximations of approximations, and the underlying mechanisms change over time.<\/li>\n<li><strong>Feedback is slow and contaminated.<\/strong> A quarterly sales forecast might be evaluated against actuals six months later \u2014 by which point the market, the competitive landscape, and the promotional environment have all changed. Attributing forecast error to model failure vs. structural change is genuinely difficult.<\/li>\n<li><strong>Reflexivity is pervasive.<\/strong> The most insidious failure mode. A demand forecast shared with a supplier triggers a production run, which creates availability, which stimulates demand, which validates the forecast not because the model was accurate but because the forecast became self-fulfilling \u2014 or, if the supplier overproduces and promotes heavily to clear inventory, self-defeating. There is no equivalent in meteorology.<\/li>\n<\/ul>\n<p><strong>A brief note on wisdom of crowds.<\/strong> Francis Galton, attending the 1906 West of England Fat Stock and Poultry Exhibition in Plymouth, observed a competition to guess the weight of a dressed ox. He collected 787 estimate cards after the contest and found the crowd&#8217;s median guess was 1,207 lb against an actual weight of 1,198 lb \u2014 a gap of 0.8% (Galton, 1907, <em>Nature<\/em> 75, 450\u2013451). Collective intelligence is real and can be impressive \u2014 provided estimates are <em>independent<\/em>. Herding, groupthink, and information cascades destroy crowd wisdom just as surely as bias destroys expert judgment. In most organizational forecasting processes, independence is exactly what gets sacrificed in the name of alignment. The meeting that ends with consensus has often produced a single estimate repeated by everyone who was present.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_domain_skill_schematic-3.png\" alt=\"Schematic \u2014 forecast skill bands by domain, illustrative\" \/><\/p>\n<hr \/>\n<h2>The AI Moment: Real Breakthroughs, Real Limits<\/h2>\n<p>The last three years have produced genuine, peer-reviewed, operationally deployed breakthroughs in AI-driven forecasting. They deserve clear-eyed acknowledgment \u2014 not the breathless coverage that makes every model sound like a revolution, and not the dismissive cynicism that refuses to credit demonstrated progress. The breakthroughs are real. So are the limits. Both matter.<\/p>\n<p><strong>GraphCast.<\/strong> In December 2023, DeepMind published results for GraphCast (Lam et al., 2023, <em>Science<\/em> 382, 1416\u20131421, doi:10.1126\/science.adi2336). The headline numbers: GraphCast outperforms ECMWF&#8217;s HRES deterministic forecast \u2014 the global benchmark system operated by the European Centre for Medium-Range Weather Forecasts \u2014 on <strong>90% of 1,380 verification targets<\/strong>. It generates a 10-day global forecast in <strong>under one minute on a single TPU<\/strong>. The same computation on traditional numerical weather prediction infrastructure would have taken thousands of CPU-hours and specialized supercomputer time. GraphCast is a graph neural network trained on four decades of ERA5 reanalysis data. It is not a physics model \u2014 it learned the dynamics of the atmosphere from observation rather than from governing equations \u2014 and it works.<\/p>\n<p><strong>Pangu-Weather.<\/strong> Simultaneously, Huawei&#8217;s research team published Pangu-Weather (Bi et al., 2023, <em>Nature<\/em> 619, 533\u2013538, doi:10.1038\/s41586-023-06185-3), which achieved something the GraphCast paper was careful to frame more precisely: Pangu-Weather was the <strong>first AI system to outperform traditional NWP across all variables and lead times from 1 hour to 7 days<\/strong>. This was the first demonstration of end-to-end superiority over physics-based forecasting across the full operational spectrum, not just selected metrics.<\/p>\n<p><strong>Aurora.<\/strong> Microsoft Research extended the same architecture to a broader class of Earth system problems with Aurora (Bodnar et al., 2024, arXiv:2405.13063) \u2014 a <strong>1.3 billion parameter<\/strong> foundation model trained on over a million hours of diverse atmospheric and environmental data. Aurora predicts not only weather but air pollution, ocean wave heights, and other Earth system variables within a single model. Its reported efficiency advantage is striking: the paper documents <strong>roughly a \u00d75,000 speedup over IFS<\/strong> \u2014 the Integrated Forecast System operated by ECMWF \u2014 for comparable prediction tasks. A single model, trained once, doing in minutes what the world&#8217;s most sophisticated physics-based forecast system does in hours.<\/p>\n<p><strong>ECMWF AIFS.<\/strong> The institutional response from the world&#8217;s most respected operational forecasting centre is perhaps the clearest signal that this transition is real rather than theoretical. ECMWF&#8217;s Artificial Intelligence\/Integrated Forecasting System Single (AIFS Single) 1.0 became <strong>operational on 25 February 2025<\/strong>, as announced in ECMWF Newsletter 183 (ECMWF, 2025, &#8222;EarthCARE and the AIFS Single&#8220;). ECMWF \u2014 the organization that has operated the gold-standard physics-based medium-range forecast for fifty years \u2014 now runs an AI model operationally alongside it. That is not a press release. That is deployment.<\/p>\n<p>These are not all the same development. GraphCast and Pangu-Weather demonstrated that AI could match and beat operational NWP on standard verification metrics. Aurora demonstrated that a single foundation model could generalize across Earth system domains with extraordinary computational efficiency. ECMWF&#8217;s operational deployment demonstrated that a regulator-grade institution with the highest forecasting standards in the world considered the technology mature enough to stake its operational output on it. Taken together, as of late 2025, they constitute a genuine paradigm shift in the field.<\/p>\n<p><strong>Why these breakthroughs happened here.<\/strong> It is not coincidental that AI weather forecasting produced results that AI demand forecasting, AI political forecasting, and AI financial forecasting have not. The same four conditions that enabled the quiet revolution in numerical weather prediction are what enabled the AI revolution in weather forecasting. Dense, high-quality observations provided the training data. Strong physical priors were embedded in the reanalysis datasets the models trained on. Fast feedback loops enabled rigorous validation. No reflexivity meant the models could be tested against ground truth without worrying that the forecast had contaminated the outcome.<\/p>\n<p>AI is compressing the available gains in physics-rich, well-observed, non-reflexive domains. That is genuinely remarkable. It is not the same thing as breaking the underlying mathematical constraints that make other domains hard.<\/p>\n<p><strong>The transplant problem.<\/strong> The same techniques have been applied to general time series \u2014 TimeGPT from Nixtla, TimesFM from Google, Chronos from Amazon \u2014 with considerably more modest results. As we showed in the VAR comparison cited above, a 40-year-old classical statistical method outperformed a large foundation model on a standard supply chain demand task by a factor of more than 1.5 in RMSE. This is not a criticism of the model developers; it reflects a structural reality. A large language model of the atmosphere works because the atmosphere has grammar \u2014 governing equations, physical conservation laws, stable dynamics. A large language model of retail demand is working on text that has less stable grammar, more reflexivity, and far sparser training signal per prediction.<\/p>\n<p>Lorenz&#8217;s 1963 result has a corollary that is less often quoted: the theoretical ceiling on predictability is domain-specific. For the atmosphere, the ceiling is around ten days. For supply chain demand in a novel product category during a promotional period, the ceiling may be measured in hours. No architecture changes that.<\/p>\n<p><strong>The honest claim.<\/strong> AI will continue to improve forecasting in domains where the four conditions hold. Weather, ocean state, air quality, wildfire spread, flood inundation \u2014 these are all tractable with the right data and the right physical priors, and AI is accelerating progress on all of them. But AI will not deliver 100% certainty about the future in any domain. That is a mathematical impossibility, not a research roadmap. The practitioners who understand this distinction \u2014 who use AI weather models to get a better storm track but don&#8217;t treat a six-day ensemble forecast as a guarantee \u2014 are applying the technology correctly. Those who expect the same performance from an AI demand forecaster or an AI political analyst are repeating Croesus&#8217;s mistake with better branding.<\/p>\n<p>The question, then, is not how to find a better oracle. It is how to build a decision system that survives without one.<\/p>\n<hr \/>\n<h2>How to Survive the Forecasting Game<\/h2>\n<p>The argument so far has a defeatist reading that I want to explicitly reject. The fact that forecasting has hard limits in most domains does not mean you should stop forecasting. It means you should stop expecting forecasting to do what it cannot do \u2014 and redirect that energy toward three moves that actually change outcomes.<\/p>\n<p><strong>Move 1: Match the method to the domain.<\/strong><\/p>\n<p>The worst forecasting errors are not model failures. They are category errors: applying a tool designed for one type of problem to a fundamentally different type of problem. Using a statistical model calibrated on a stable mature product to forecast a new product launch. Applying a weather-forecasting confidence interval to a four-year macroeconomic projection. Trusting a point forecast for a fat-tailed distribution.<\/p>\n<p>The practical framework for avoiding this \u2014 choosing the right metric, the right method, and the right validation approach for the structure of the problem you&#8217;re actually facing \u2014 is covered in detail in <a href=\"https:\/\/inphronesys.com\/?p=1842\">Is Your Forecast Any Good? The Forecaster&#8217;s Toolbox<\/a>. The key insight is that &#8222;which model is most accurate&#8220; is almost always the wrong question. The right question is &#8222;what are the structural properties of this forecasting task, and what class of method is appropriate for those properties?&#8220; Classical methods like exponential smoothing \u2014 <a href=\"https:\/\/inphronesys.com\/?p=1793\">three equations from the US Navy in 1957<\/a>, still outperforming many of their successors \u2014 remain appropriate for the vast majority of supply chain forecasting problems precisely because they make modest assumptions that match the modest information content in most demand signals. For most stable mature SKUs without heavy promotional activity or structural breaks, exponential smoothing will match or beat a foundation model at a fraction of the complexity cost. The M5 result and the VAR-vs-TimesFM comparison cited earlier both say so.<\/p>\n<p>Calibration training matters here. Tetlock&#8217;s superforecasters \u2014 the top performers in the Good Judgment Project&#8217;s IARPA-funded tournaments \u2014 did not win by being smarter or better-informed. They won by expressing calibrated uncertainty: saying &#8222;70% probability&#8220; when they meant 70%, not 90%, and tracking their track record to close the feedback loop. Decision hygiene, not model sophistication, was the distinguishing factor.<\/p>\n<p><strong>Move 2: Reduce noise before chasing accuracy.<\/strong><\/p>\n<p>In 2021, Daniel Kahneman, Olivier Sibony, and Cass Sunstein published <em>Noise: A Flaw in Human Judgment<\/em> (Little, Brown Spark). The book&#8217;s central claim is empirically documented and organizationally important: in most decision systems, the variance between different judges evaluating the same case \u2014 noise \u2014 exceeds the systematic bias of any individual judge. Insurance underwriters, bail judges, radiologists, performance reviewers: when you give the same case to two different people, you get surprisingly different answers, and neither is reliably closer to ground truth.<\/p>\n<p>This matters for forecasting because most practitioners&#8216; attention goes to bias correction and model selection, while noise goes unaddressed. Two demand planners at the same company, given the same historical data, will produce materially different forecasts \u2014 not because one is biased and the other isn&#8217;t, but because forecasting involves judgment calls at every stage (which outliers to include, how to handle the promotional lift, how to weight the sales team&#8217;s input), and those judgment calls have high variance.<\/p>\n<p>The Kahneman-Sibony-Sunstein prescription is <em>decision hygiene<\/em> rather than decision suppression: structured protocols, independent estimates before group discussion, calibration tracking, explicit uncertainty quantification. These are not exotic technologies. They are the forecasting equivalent of handwashing \u2014 cheap, unfashionable, and demonstrably effective.<\/p>\n<p>The implication for supply chain teams is direct: before you spend three months selecting and implementing a new forecasting model, audit the variance in your current forecasts. If two planners working the same SKU produce forecasts that differ by more than your forecast error target, you have a noise problem that no model will solve. Structured process changes will outperform model upgrades at a fraction of the cost.<\/p>\n<p>The most effective noise-reduction interventions documented in the literature are also the most organizationally uncomfortable: independent forecasts prepared before any group discussion, explicit written justifications for adjustments to statistical baselines, structured override logs that make algorithmic-versus-human adjustments visible and auditable. These are uncomfortable because they make individual judgment legible and accountable in ways that informal forecasting processes do not. But that visibility is exactly the mechanism. When planners know that their overrides are tracked against outcomes, the noise in those overrides falls \u2014 not because people become less biased but because they become more deliberate. The goal is not to suppress judgment; it is to make judgment auditable enough that you can learn from it. That learning loop, applied consistently, is more durable than any model selection decision.<\/p>\n<p><strong>Move 3: Build for being wrong.<\/strong><\/p>\n<p>Nassim Nicholas Taleb&#8217;s 2012 book <em>Antifragile: Things That Gain from Disorder<\/em> (Random House) offers what is perhaps the most practically useful reframe for practitioners who have internalized the limits described in this post. Taleb&#8217;s argument is not that forecasting is useless \u2014 it is that the <em>payoff function<\/em> of a decision system matters more than the accuracy of the forecast that feeds it.<\/p>\n<p>A fragile system is one whose downside from forecast error is unbounded while the upside is bounded. A robust system caps the downside regardless of forecast accuracy. An antifragile system actually benefits from volatility and disorder \u2014 its payoff is convex to uncertainty. The goal of supply chain design, inventory strategy, and organizational planning should not be to eliminate forecast error but to design systems that are not destroyed by it.<\/p>\n<p>In concrete terms: single-source dependencies are fragile because a supply disruption that was not forecast triggers a cascading failure. Modular sourcing with multiple suppliers is robust because the magnitude of any single forecast error is capped. Options-based capacity contracting \u2014 paying a premium for the right but not the obligation to expand \u2014 is antifragile because it benefits from demand volatility that would destroy a fixed-capacity competitor.<\/p>\n<p>The practical test for any supply chain decision is the payoff asymmetry question: if my demand forecast is wrong by 30% on the high side, what is my loss? If it is wrong by 30% on the low side, what is my loss? If the downside of under-forecasting is catastrophically larger than the downside of over-forecasting \u2014 or vice versa \u2014 the system is fragile to forecast error regardless of how accurate the forecast is on average. The right response is not to try to make the forecast more accurate; it is to restructure the decision so that the error cost is more symmetric. Holding more safety stock, shortening replenishment cycles, building in postponement capability \u2014 these are all moves that trade some efficiency for convexity in the payoff function. They are insurance that pays off specifically when the forecast fails, which is exactly when you most need it.<\/p>\n<p>This is not a counsel of pessimism. It is a design principle that, once internalized, frees the forecasting function from the impossible obligation of predicting the future accurately and redirects it toward the achievable obligation of producing useful probability statements that decision-makers use without collapsing them into false certainties.<\/p>\n<p>Croesus did not fail because the oracle was wrong. He failed because he confused a probability statement with a promise. Every forecasting failure since has been a variation on the same mistake.<\/p>\n<hr \/>\n<h2>Interactive Dashboard<\/h2>\n<p>The Forecast Trust Calculator below lets you select a domain, a forecast horizon, and a method, then returns the trust band supported by the evidence in this post, a decision recommendation, and the failure modes most likely to break the forecast.<\/p>\n<div class=\"dashboard-link\" style=\"margin: 2em 0; padding: 1.5em; background: #f8f9fa; border-left: 4px solid #0073aa; border-radius: 4px;\">\n<p style=\"margin: 0 0 0.5em 0; font-size: 1.1em;\"><strong>Interactive Dashboard<\/strong><\/p>\n<p style=\"margin: 0 0 1em 0;\">Explore the data yourself \u2014 adjust parameters and see the results update in real time.<\/p>\n<p><a style=\"display: inline-block; padding: 0.6em 1.2em; background: #0073aa; color: #fff; text-decoration: none; border-radius: 4px; font-weight: bold;\" href=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/2026-04-29_Folly_of_Forecasting_dashboard-3.html\" target=\"_blank\" rel=\"noopener\">Open Interactive Dashboard \u2192<\/a><\/p>\n<\/div>\n<hr \/>\n<details>\n<summary><strong>Show R Code<\/strong><\/summary>\n<pre><code class=\"language-r\"># =============================================================================\n# generate_folly_of_forecasting_images.R\n# April 2026 capstone: \"The Folly of Forecasting\"\n#\n# Run from the 08_Blog_Site\/ project root:\n#   Rscript Scripts\/generate_folly_of_forecasting_images.R\n#\n# Produces:\n#   https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_ecmwf_skill_evolution-3.png   (800 x 500 px, white bg)\n#   https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_domain_skill_schematic-3.png  (800 x 500 px, white bg)\n# =============================================================================\n\nsource(\"Scripts\/theme_inphronesys.R\")\n\nlibrary(ggplot2)\nlibrary(dplyr)\n\n# -----------------------------------------------------------------------------\n# CHART 1: ECMWF Northern Hemisphere 500 hPa ACC -- Illustrative Evolution\n# Source: Bauer, Thorpe &amp; Brunet (2015), Nature 525, 47-55\n# These are stylised trajectories capturing the documented ~1-day-per-decade\n# skill gain. They are NOT read from primary data tables.\n# Caption labels this explicitly as \"Illustrative\".\n# -----------------------------------------------------------------------------\n\necmwf_data &lt;- data.frame(\n  year = rep(c(1981, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020, 2023), 4),\n  horizon = factor(\n    rep(c(\"3-day\", \"5-day\", \"7-day\", \"10-day\"), each = 10),\n    levels = c(\"3-day\", \"5-day\", \"7-day\", \"10-day\")\n  ),\n  acc = c(\n    # 3-day: starts high, saturates by ~2005\n    0.84, 0.87, 0.90, 0.92, 0.94, 0.95, 0.96, 0.97, 0.98, 0.98,\n    # 5-day: from ~0.63 to ~0.93\n    0.63, 0.68, 0.73, 0.78, 0.82, 0.85, 0.87, 0.90, 0.92, 0.93,\n    # 7-day: crosses ~0.60 threshold around 2000\n    0.40, 0.45, 0.51, 0.56, 0.61, 0.66, 0.70, 0.75, 0.79, 0.82,\n    # 10-day: approaches useful skill by the 2020s\n    0.20, 0.25, 0.30, 0.36, 0.42, 0.47, 0.52, 0.57, 0.63, 0.67\n  )\n)\n\nhorizon_colors &lt;- c(\n  \"3-day\"  = iph_colors$blue,\n  \"5-day\"  = iph_colors$teal,\n  \"7-day\"  = iph_colors$orange,\n  \"10-day\" = iph_colors$red\n)\n\np1 &lt;- ggplot(ecmwf_data, aes(x = year, y = acc, color = horizon, group = horizon)) +\n  geom_hline(yintercept = 0.60, linetype = \"dashed\",\n             color = iph_colors$grey, linewidth = 0.5, alpha = 0.8) +\n  annotate(\"text\", x = 1982, y = 0.625,\n           label = \"Useful skill threshold (ACC ~ 0.60)\",\n           hjust = 0, vjust = 0, size = 3.0,\n           color = iph_colors$grey, family = \"Inter\") +\n  geom_line(linewidth = 1.1, alpha = 0.9) +\n  geom_point(size = 2.2, alpha = 0.9) +\n  scale_color_manual(values = horizon_colors, name = \"Forecast horizon\") +\n  scale_x_continuous(breaks = seq(1985, 2020, 5), expand = expansion(add = 1)) +\n  scale_y_continuous(limits = c(0.15, 1.02), breaks = seq(0.2, 1.0, 0.2),\n                     labels = function(x) sprintf(\"%.1f\", x),\n                     expand = expansion(mult = 0)) +\n  labs(\n    title    = \"ECMWF Northern Hemisphere Forecast Skill, 1981-2023\",\n    subtitle = \"500 hPa anomaly correlation coefficient (ACC) by forecast horizon\",\n    x        = NULL,\n    y        = \"Anomaly Correlation Coefficient (ACC)\",\n    caption  = \"Illustrative -- based on published ECMWF skill score trends\\nSource: Bauer, Thorpe &amp; Brunet (2015), Nature 525, 47-55, doi:10.1038\/nature14956\"\n  ) +\n  theme_inphronesys(base_size = 13, grid = \"y\") +\n  theme(legend.position = \"right\")\n\nggsave(\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_ecmwf_skill_evolution-3.png\",\n       plot = p1, width = 8, height = 5, dpi = 100, bg = \"white\")\n\n# -----------------------------------------------------------------------------\n# CHART 2: Forecast Skill by Domain -- Schematic (tile \/ cell grid)\n# Each domain occupies one or two qualitative skill-band cells.\n# NO pseudo-quantitative axis values. Title and caption say \"Schematic\".\n# Sources: Bauer 2015 \/ Makridakis M5 \/ Tetlock 2005 \/ Taleb 2007\n# -----------------------------------------------------------------------------\n\ndomains_ordered &lt;- c(\n  \"Financial Markets\\n(Taleb 2007)\",\n  \"Political Events\\n(Tetlock 2005)\",\n  \"Demand \/ Supply Chain\\n(Makridakis M5)\",\n  \"Weather (3-7 day)\\n(Bauer et al. 2015)\"\n)\nbands_ordered &lt;- c(\"Near-random\", \"Low\", \"Moderate\", \"High\")\n\nactive_cells &lt;- data.frame(\n  domain = factor(c(\n    \"Financial Markets\\n(Taleb 2007)\",     \"Financial Markets\\n(Taleb 2007)\",\n    \"Political Events\\n(Tetlock 2005)\",    \"Political Events\\n(Tetlock 2005)\",\n    \"Demand \/ Supply Chain\\n(Makridakis M5)\", \"Demand \/ Supply Chain\\n(Makridakis M5)\",\n    \"Weather (3-7 day)\\n(Bauer et al. 2015)\", \"Weather (3-7 day)\\n(Bauer et al. 2015)\"\n  ), levels = domains_ordered),\n  band = factor(c(\n    \"Near-random\", \"Low\",       # Markets\n    \"Near-random\", \"Low\",       # Politics\n    \"Low\",         \"Moderate\",  # Demand\n    \"Moderate\",    \"High\"       # Weather\n  ), levels = bands_ordered)\n)\n\nfull_grid &lt;- expand.grid(\n  domain = factor(domains_ordered, levels = domains_ordered),\n  band   = factor(bands_ordered,   levels = bands_ordered),\n  stringsAsFactors = FALSE\n)\nactive_cells$active &lt;- TRUE\nfull_grid &lt;- merge(full_grid, active_cells, by = c(\"domain\", \"band\"), all.x = TRUE)\nfull_grid$active[is.na(full_grid$active)] &lt;- FALSE\n\ndomain_colors &lt;- c(\n  \"Financial Markets\\n(Taleb 2007)\"              = iph_colors$grey,\n  \"Political Events\\n(Tetlock 2005)\"             = iph_colors$red,\n  \"Demand \/ Supply Chain\\n(Makridakis M5)\"       = iph_colors$orange,\n  \"Weather (3-7 day)\\n(Bauer et al. 2015)\"       = iph_colors$blue\n)\nfull_grid$fill &lt;- ifelse(full_grid$active,\n                          domain_colors[as.character(full_grid$domain)],\n                          iph_colors$lightgrey)\nfull_grid$alpha_val &lt;- ifelse(full_grid$active, 0.85, 0.5)\n\nband_bg &lt;- data.frame(\n  band   = factor(bands_ordered, levels = bands_ordered),\n  bg_fill = c(\"#fef2f2\", \"#fff7ed\", \"#f0fdf4\", \"#eff6ff\")\n)\nfull_grid &lt;- merge(full_grid, band_bg, by = \"band\")\n\np2 &lt;- ggplot(full_grid, aes(x = band, y = domain)) +\n  geom_tile(aes(fill = bg_fill), width = 1,\n            height = length(domains_ordered) + 0.5, alpha = 0.6) +\n  geom_tile(aes(fill = fill, alpha = alpha_val),\n            width = 0.88, height = 0.72, color = \"white\", linewidth = 1.2) +\n  scale_fill_identity() + scale_alpha_identity() +\n  scale_x_discrete(position = \"top\") +\n  labs(\n    title    = \"Forecast Skill by Domain -- Schematic\",\n    subtitle = \"Schematic -- illustrative based on cited sources.\\nSkill bands are qualitative, not quantitative.\",\n    x        = \"Typical Forecast Skill (Qualitative)\",\n    y        = NULL,\n    caption  = \"Schematic -- illustrative based on cited sources\\nWeather: Bauer et al. 2015 | Demand: Makridakis M5 | Political: Tetlock 2005 | Markets: Taleb 2007\"\n  ) +\n  theme_inphronesys(base_size = 13, grid = \"none\") +\n  theme(\n    axis.text.y      = element_text(size = 11, color = iph_colors$dark, lineheight = 1.1),\n    axis.text.x      = element_text(size = 11, face = \"bold\", color = iph_colors$dark),\n    axis.title.x     = element_blank(),\n    panel.background = element_blank()\n  )\n\nggsave(\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/folly_domain_skill_schematic-3.png\",\n       plot = p2, width = 8, height = 5, dpi = 100, bg = \"white\")\n<\/code><\/pre>\n<\/details>\n<hr \/>\n<h2>References<\/h2>\n<ol>\n<li>Bauer, P., Thorpe, A., &amp; Brunet, G. (2015). The quiet revolution of numerical weather prediction. <em>Nature<\/em>, 525, 47\u201355. <a href=\"https:\/\/www.nature.com\/articles\/nature14956\">https:\/\/doi.org\/10.1038\/nature14956<\/a><\/li>\n<li>Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., &amp; Tian, Q. (2023). Accurate medium-range global weather forecasting with 3D neural networks. <em>Nature<\/em>, 619, 533\u2013538. <a href=\"https:\/\/www.nature.com\/articles\/s41586-023-06185-3\">https:\/\/doi.org\/10.1038\/s41586-023-06185-3<\/a><\/li>\n<li>Bodnar, C., et al. (2024). A Foundation Model for the Earth System. <em>arXiv:2405.13063<\/em>. <a href=\"https:\/\/arxiv.org\/abs\/2405.13063\">https:\/\/arxiv.org\/abs\/2405.13063<\/a><\/li>\n<li>ECMWF. (2025). EarthCARE and the AIFS Single. <em>ECMWF Newsletter<\/em> 183. <a href=\"https:\/\/www.ecmwf.int\/en\/newsletter\/183\/editorial\/earthcare-and-aifs-single\">https:\/\/www.ecmwf.int\/en\/newsletter\/183\/editorial\/earthcare-and-aifs-single<\/a><\/li>\n<li>Galton, F. (1907). Vox Populi. <em>Nature<\/em>, 75, 450\u2013451. <a href=\"https:\/\/www.nature.com\/articles\/075450a0\">https:\/\/www.nature.com\/articles\/075450a0<\/a><\/li>\n<li>Herodotus. <em>Histories<\/em>, Book 1 (ca. 430 BC). Croesus and the oracle at Delphi: Books 1.53 and 1.91. [Public domain; multiple translations available.]<\/li>\n<li>Kahneman, D., Sibony, O., &amp; Sunstein, C. R. (2021). <em>Noise: A Flaw in Human Judgment<\/em>. Little, Brown Spark. <a href=\"https:\/\/www.hachettebookgroup.com\/titles\/daniel-kahneman\/noise\/9780316451406\/\">https:\/\/www.hachettebookgroup.com\/titles\/daniel-kahneman\/noise\/9780316451406\/<\/a><\/li>\n<li>Lam, R., et al. (2023). Learning skillful medium-range global weather forecasting. <em>Science<\/em>, 382, 1416\u20131421. <a href=\"https:\/\/www.science.org\/doi\/10.1126\/science.adi2336\">https:\/\/doi.org\/10.1126\/science.adi2336<\/a><\/li>\n<li>Lorenz, E. N. (1963). Deterministic nonperiodic flow. <em>Journal of the Atmospheric Sciences<\/em>, 20, 130\u2013141. <a href=\"https:\/\/journals.ametsoc.org\/view\/journals\/atsc\/20\/2\/1520-0469_1963_020_0130_dnf_2_0_co_2.xml\">https:\/\/journals.ametsoc.org\/view\/journals\/atsc\/20\/2\/1520-0469_1963_020_0130_dnf_2_0_co_2.xml<\/a><\/li>\n<li>Taleb, N. N. (2012). <em>Antifragile: Things That Gain from Disorder<\/em>. Random House. <a href=\"https:\/\/www.penguinrandomhouse.com\/books\/176227\/antifragile-by-nassim-nicholas-taleb\/\">https:\/\/www.penguinrandomhouse.com\/books\/176227\/antifragile-by-nassim-nicholas-taleb\/<\/a><\/li>\n<li>Tetlock, P. E. (2005). <em>Expert Political Judgment: How Good Is It? How Can We Know?<\/em> Princeton University Press. <a href=\"https:\/\/press.princeton.edu\/books\/hardcover\/9780691178288\/expert-political-judgment\">https:\/\/press.princeton.edu\/books\/hardcover\/9780691178288\/expert-political-judgment<\/a><\/li>\n<li>Tetlock, P. E., &amp; Gardner, D. (2015). <em>Superforecasting: The Art and Science of Prediction<\/em>. Crown.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>From the Pythia at Delphi to Google&#8217;s GraphCast \u2014 humans have always demanded the future. Why some forecasts have gotten dramatically better, why others haven&#8217;t, and why AI will not deliver the silver bullet.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[206,115],"tags":[304,314,8,313,311,312,283,310,309,308],"class_list":["post-1949","post","type-post","status-publish","format-standard","hentry","category-forecasting","category-supply-chain-management","tag-ai","tag-ecmwf","tag-forecasting","tag-graphcast","tag-kahneman","tag-lorenz","tag-m5","tag-taleb","tag-tetlock","tag-weather-forecasting"],"_links":{"self":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1949","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1949"}],"version-history":[{"count":2,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1949\/revisions"}],"predecessor-version":[{"id":1951,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1949\/revisions\/1951"}],"wp:attachment":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1949"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1949"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1949"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}