{"id":1842,"date":"2026-04-14T20:02:18","date_gmt":"2026-04-14T20:02:18","guid":{"rendered":"https:\/\/inphronesys.com\/?p=1842"},"modified":"2026-04-14T20:02:18","modified_gmt":"2026-04-14T20:02:18","slug":"is-your-forecast-any-good-the-forecasters-toolbox","status":"publish","type":"post","link":"https:\/\/inphronesys.com\/?p=1842","title":{"rendered":"Is Your Forecast Any Good? The Forecaster&#8217;s Toolbox"},"content":{"rendered":"<p>A vendor once told me their forecast was 92% accurate.<\/p>\n<p>I asked which metric. They said <em>accuracy<\/em>. That was the moment the meeting ended.<\/p>\n<p>&quot;Accuracy&quot; is not a forecast metric. It is a word people use when they want you to stop asking questions. It carries the self-evident authority of a round number \u2014 92% sounds rigorous, sounds measured, sounds like someone ran a calculation and got a good result. But it could mean anything. It could mean the percentage of periods where the forecast was within 10% of actual. It could mean the percentage of SKUs where the direction was correct. It could mean the person subtracted their MAPE from 100 and called that &quot;accuracy.&quot;<\/p>\n<p>Every one of those definitions will give you a different number. Every one of them can be gamed. And every one of them will look fantastic on a vendor slide until the day your warehouse is stuffed with stock nobody ordered and empty of everything customers actually want.<\/p>\n<p>Good forecasting requires a scorecard. Not a single number, not a marketing figure, but a set of diagnostic tools that each answer a specific question about forecast quality. The problem is that most supply chain professionals were never taught which tools exist, what they measure, or when each one lies to you.<\/p>\n<p>That is what this post fixes.<\/p>\n<p>We will cover:<\/p>\n<ul>\n<li>The benchmark every forecast must beat before any metric matters<\/li>\n<li>Four metrics \u2014 MAE, RMSE, MAPE, MASE \u2014 and the specific situations where each one misleads you<\/li>\n<li>A decision matrix for picking the right metric for your situation<\/li>\n<li>Residual diagnostics: the three charts that reveal whether your model is hiding something<\/li>\n<li>The train\/test split: why in-sample accuracy is almost always a lie<\/li>\n<\/ul>\n<p>By the end, you will be able to look at any forecast \u2014 your own, your team&#8217;s, your vendor&#8217;s \u2014 and ask the questions that actually matter. Let&#8217;s build the toolbox.<\/p>\n<h2>The Benchmark That Catches Liars: Naive and Seasonal Naive<\/h2>\n<p>Before we talk about metrics, we need to talk about yardsticks.<\/p>\n<p>A metric measures how far off your forecast was. A benchmark tells you whether that distance is impressive or embarrassing. You need both. Without a benchmark, any metric is just a number with no frame of reference.<\/p>\n<p>The most important benchmark in supply chain forecasting is the <strong>seasonal naive forecast<\/strong>. It is exactly what it sounds like: take last year&#8217;s value for the same period, and use it as the forecast. October&#8217;s forecast is last October&#8217;s actual. The forecast for week 23 of this year is the actual from week 23 of last year.<\/p>\n<p>That&#8217;s it. No model. No parameters. No training. An intern could build it in Excel in twenty minutes.<\/p>\n<p>And here is the embarrassing truth: on real supply chain data, it beats sophisticated models more often than anyone likes to admit.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_snaive_beats_ets-2.png\" alt=\"A sophisticated ETS model loses to Seasonal Naive on the hold-out period\" \/><\/p>\n<p>In this chart, both models were trained on the same historical data. The ETS model is doing something genuinely clever \u2014 it&#8217;s estimating trend, level, and seasonal components simultaneously and updating them as errors arrive. The Seasonal Naive model is doing nothing except copying last year. On the training data, ETS wins comfortably. On the hold-out test period \u2014 the only accuracy that matters \u2014 Seasonal Naive edges ahead.<\/p>\n<p>This is not unusual. It is, in fact, the Hyndman &amp; Athanasopoulos curriculum&#8217;s way of making a blunt point: a complex model that overfits the training data is not better than a simple model that generalises. The benchmark is the tribunal that makes this visible.<\/p>\n<p><strong>The rule:<\/strong> before you report any accuracy metric, always compute the same metric against a naive or seasonal naive baseline. If your model doesn&#8217;t beat the baseline, you don&#8217;t have a model \u2014 you have a more expensive version of doing nothing. The <code>fpp3<\/code> package makes this effortless: <code>SNAIVE()<\/code> and <code>NAIVE()<\/code> are first-class model types in <code>fable<\/code>, fit and evaluated exactly like ETS or ARIMA.<\/p>\n<p>The naive benchmark is not the ceiling you aspire to. It is the floor you must clear. Once you&#8217;ve cleared it, the metrics below tell you by how much.<\/p>\n<h2>Four Metrics, Four Personalities<\/h2>\n<p>Forecast accuracy metrics are not interchangeable. Each one measures something different, emphasises different kinds of errors, and breaks down under different conditions. Here are the four metrics that matter most in supply chain \u2014 what each one does, what it hides, and when it will lie to you.<\/p>\n<h3>MAE \u2014 The Honest Average Miss<\/h3>\n<p>MAE stands for Mean Absolute Error. In plain English: take every forecast error (actual minus forecast), ignore whether it&#8217;s positive or negative, and average them. The result is in the same units as your demand \u2014 pieces, pallets, euros, kilograms.<\/p>\n<p><strong>What it tells you:<\/strong> &quot;On average, how many units did I miss by?&quot; If your MAE is 45 units and you produce in batches of 200, you are probably fine. If your MAE is 45 units and your batch size is 50, you have a serious problem. The units-level interpretation is MAE&#8217;s greatest strength \u2014 it is immediately operationally meaningful.<\/p>\n<p><strong>What it hides:<\/strong> The averaging is symmetric and linear. A forecast that misses by 100 units in January and nails February and March perfectly has the same MAE as one that misses by 33 units every single month. These are very different demand patterns, but MAE treats them identically. It also hides the direction of bias \u2014 you can&#8217;t tell from MAE alone whether you are systematically over-forecasting, under-forecasting, or just noisy. For that, you would look at ME (Mean Error) separately, or examine the residual plot (more on that below).<\/p>\n<p><strong>When to use it:<\/strong> MAE is the right default for high-volume, stable SKUs where demand is well above zero and you care primarily about average operational impact. It is the most intuitive metric for operations teams and S&amp;OP conversations \u2014 straightforward to explain, hard to misinterpret, impossible to game through rescaling. It is the reference metric that MASE is defined against, which is a strong vote of confidence from the people who invented the alternatives.<\/p>\n<h3>RMSE \u2014 The Metric That Hates Outliers<\/h3>\n<p>RMSE stands for Root Mean Squared Error. The &quot;squared&quot; part is where the action is: instead of taking the absolute value of each error, you square it before averaging, then take the square root at the end. This has one important consequence: large errors get punished much more heavily than small ones.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mae_vs_rmse-2.png\" alt=\"Two forecasts with identical MAE but very different RMSE \u2014 one has a single large miss\" \/><\/p>\n<p>In this chart, both Forecast A and Forecast B have exactly the same MAE. Their average miss in absolute terms is identical. But look at the error distribution: Forecast A has consistently moderate errors spread across the period. Forecast B is accurate most of the time \u2014 and then catastrophically wrong in one period, with a miss several times larger than anything Forecast A produces.<\/p>\n<p>If you evaluate both forecasts using MAE, they look equally good. If you evaluate them using RMSE, Forecast B looks significantly worse, because that one enormous miss gets squared before averaging. RMSE &quot;sees&quot; the outlier. MAE does not.<\/p>\n<p><strong>When does this matter?<\/strong> Think about the cost structure of your operation. For <strong>perishables<\/strong>, a single large over-forecast means spoilage \u2014 not proportionally worse than a moderate miss, but catastrophically worse. For <strong>critical spare parts<\/strong> in capital equipment, a single large under-forecast means a production line going down. In both cases, the penalty for a big miss is non-linear \u2014 one huge error costs far more than many small ones of the same total size. RMSE models that cost structure. MAE does not.<\/p>\n<p><strong>What RMSE hides:<\/strong> Because outlier periods dominate the RMSE calculation, the metric can be misleading if your historical data contains a few anomalous periods (promotions, pandemic disruptions, one-time customer orders). Those periods inflate RMSE in ways that don&#8217;t reflect typical operating conditions. If you compare models on RMSE and the dominant signal is one or two extraordinary months, you may end up selecting for robustness to those specific months rather than general forecasting skill.<\/p>\n<p>Use RMSE when big misses are genuinely more costly than their size suggests, and when your data is clean of one-off outliers that you don&#8217;t need your model to handle.<\/p>\n<h3>MAPE \u2014 The Percentage Everyone Quotes, the Metric That Lies on Slow Movers<\/h3>\n<p>MAPE stands for Mean Absolute Percentage Error. It takes the absolute error for each period and expresses it as a percentage of the actual demand for that period, then averages those percentages. This is why supply chain managers love it: &quot;Our MAPE is 12%&quot; communicates something intuitive, scale-free, and easy to put in a board presentation.<\/p>\n<p>It is also the most dangerous metric to use on slow-moving and intermittent demand.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mape_explodes-2.png\" alt=\"MAPE percentage errors blow up near zero demand\" \/><\/p>\n<p>Here&#8217;s what goes wrong. The MAPE calculation divides the error by the actual demand. When actual demand is 100 units and you missed by 10, the percentage error is 10% \u2014 sensible. When actual demand is 2 units and you missed by 2, the percentage error is 100% \u2014 painful but interpretable. When actual demand is 1 unit and you missed by 2 (a very common scenario for slow movers), the percentage error is 200%. When actual demand is 0 and any forecast at all is non-zero, you are dividing by zero \u2014 and MAPE becomes undefined or, in some implementations, infinity.<\/p>\n<p>Dividing by something close to zero makes the percentage explode. A single period of near-zero demand can dominate the entire MAPE calculation, making an otherwise reasonable forecast look catastrophic. Worse, because you&#8217;re dividing by the actual (not the forecast), over-forecasting and under-forecasting are penalised asymmetrically: a forecast of 100 when actual is 1 gives 9,900% error, while a forecast of 0 when actual is 100 gives 100% error. The same absolute miss creates wildly different MAPE values depending on direction.<\/p>\n<p><strong>What MAPE is genuinely good for:<\/strong> High-volume, well-above-zero demand with no intermittent periods. Items like fast-moving consumer goods, core catalogue items with stable high demand, or categories where zero-demand periods simply don&#8217;t occur. In those settings, MAPE is intuitive, scale-free, and easy to communicate. The number means what it sounds like.<\/p>\n<p><strong>What MAPE hides on everything else:<\/strong> Slow movers, spare parts, seasonal items with deep troughs, anything with intermittent demand \u2014 MAPE will misrepresent every one of them. If your portfolio contains a mix of fast and slow movers, averaging MAPE across the portfolio is mathematically meaningless, because the slow-mover percentages will swamp the fast-mover ones regardless of which model performs better overall.<\/p>\n<p>You will see MAPE in almost every vendor dashboard and S&amp;OP report. That does not make it right. It makes it familiar.<\/p>\n<h3>MASE \u2014 The Scale-Free One Rob Hyndman Wishes You&#8217;d Use<\/h3>\n<p>MASE stands for Mean Absolute Scaled Error. It was introduced by Rob Hyndman and Anne Koehler in 2006 precisely because MAPE was \u2014 their words \u2014 &quot;widely used, but can be infinite or undefined, and it is not meaningful if actual values are close to zero.&quot;<\/p>\n<p>MASE solves this by changing the denominator. Instead of dividing each error by the actual demand (which causes the explosion on slow movers), MASE divides by the average error that a seasonal naive forecast would have made on the training data. The result is a ratio: above 1 means your model made larger errors than seasonal naive, below 1 means your model beat it.<\/p>\n<p>In plain English: <strong>MASE below 1 means you beat naive. MASE above 1 means you should fire the model and use naive instead.<\/strong><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mase_scalefree-2.png\" alt=\"Same model evaluated across three SKU scales \u2014 MAPE varies wildly, MASE stays comparable\" \/><\/p>\n<p>This chart shows the same forecast model applied to three SKUs with very different demand volumes: a slow mover averaging 10 units\/month, a medium-volume item at 1,000 units\/month, and a fast mover at 100,000 units\/month. MAPE produces wildly different numbers for each \u2014 not because the model performs differently, but because MAPE is sensitive to the volume scale. MASE produces similar values across all three, because it is always measuring relative to seasonal naive on that specific SKU&#8217;s own scale. You can meaningfully average MASE across a mixed-volume portfolio. You cannot do that with MAPE.<\/p>\n<p><strong>What makes MASE special:<\/strong> It handles zero-demand periods gracefully (no division by zero problem), it is scale-free (meaningful across SKUs of any volume), it has a built-in interpretation threshold (1.0), and it automatically adjusts for seasonality. The denominator is calculated using the seasonal period of your data, so a monthly model uses a 12-period lag, a weekly model uses a 52-period lag \u2014 no configuration required.<\/p>\n<p><strong>The honest caveat:<\/strong> MASE is harder to explain to stakeholders who&#8217;ve only ever seen MAPE. &quot;Our MASE is 0.78&quot; doesn&#8217;t communicate intuitively to a category manager the way &quot;12% MAPE&quot; does. This is a communication problem, not a statistical one. Use MASE for model selection and portfolio benchmarking. Keep MAPE for high-volume SKU dashboards where it&#8217;s honest. Explain the distinction.<\/p>\n<h2>Which Metric When<\/h2>\n<p>The four metrics each shine in different situations. Here&#8217;s the framework:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_metric_decision_matrix-2.png\" alt=\"Visual decision matrix showing metric suitability across use cases \u2014 six common supply chain situations \u00d7 MAE, RMSE, MAPE, MASE\" \/><\/p>\n<p>A few patterns worth noting:<\/p>\n<p><strong>The mixed-portfolio trap.<\/strong> If you have 500 SKUs ranging from slow movers to fast movers and you want to know which forecast method performs best across the portfolio, MASE is the only valid aggregation. Every other metric will be dominated by volume-outlier SKUs or, in MAPE&#8217;s case, by slow movers with near-zero demand.<\/p>\n<p><strong>The stakeholder communication problem.<\/strong> MASE is hard to explain. For a C-suite dashboard, MAPE or MAE are more intuitive \u2014 but only compute them on the SKUs where they&#8217;re valid. Never show MAPE averaged across a mixed portfolio. Consider showing MAE in units alongside MASE as the &quot;honest&quot; metric, so stakeholders get both interpretability and rigour.<\/p>\n<p><strong>The perishables exception.<\/strong> RMSE is the right instinct when your cost function is non-linear \u2014 when one large miss costs much more than many small ones. But RMSE inflates when historical data contains event-driven outliers. If you use RMSE for perishables model selection, clean the anomalous periods from your training data first.<\/p>\n<p><strong>The default.<\/strong> When in doubt, compute all four. The fpp3 <code>accuracy()<\/code> function returns MAE, RMSE, MAPE, and MASE simultaneously \u2014 there is no reason to compute only one metric when you can compute all four in a single line.<\/p>\n<h2>Residual Diagnostics \u2014 Reading the Model&#8217;s Conscience<\/h2>\n<p>Accuracy metrics tell you how far off the forecast was. Residual diagnostics tell you <em>why<\/em> \u2014 or more precisely, whether the model has anything left to learn from the data.<\/p>\n<p><strong>Residuals<\/strong> are the difference between what the model predicted and what actually happened, calculated on the training data where the model had access to each observation. If a model has learned everything there is to learn from the historical pattern, the residuals should look like random noise: no structure, no drift, nothing your eye can grab onto. If there is visible structure in the residuals, that structure is signal the model missed \u2014 and it probably means the forecasts will be systematically wrong in predictable ways.<\/p>\n<p>The three-panel residual diagnostic plot is your standard tool for this check.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_residual_diagnostics-2.png\" alt=\"Three-panel residual diagnostics: residuals over time, ACF, and histogram \u2014 good vs bad model\" \/><\/p>\n<p><strong>Panel 1 \u2014 Residuals over time.<\/strong> This is simply the residuals plotted as a time series. What you&#8217;re looking for is nothing \u2014 a flat cloud of points with no visible trend, cycle, or seasonal pattern. If you can see a pattern (drifting upward, oscillating, or with one distinct shift), the model has missed some structure. Common violations in supply chain data: residuals that drift after a structural break (a product launch, a logistics change, a major new customer), or residuals that show seasonal structure (which tells you the model&#8217;s seasonal component is wrong).<\/p>\n<p><strong>Panel 2 \u2014 ACF (Autocorrelation Function).<\/strong> The ACF plot shows whether current residuals are correlated with past residuals at various time lags. In a well-specified model, residuals at lag 1, lag 2, lag 3&#8230; should all be approximately zero \u2014 knowing what happened last month shouldn&#8217;t help you predict this month&#8217;s residual. <strong>White noise<\/strong> is the technical term for this ideal state. In the ACF plot, white noise looks like bars all staying inside the blue confidence bands. If bars stick out above the bands at any lag, that lag is significant \u2014 there is autocorrelation in the residuals, which means there is exploitable signal the model hasn&#8217;t captured. This is the forecasting equivalent of leaving money on the table.<\/p>\n<p><strong>Panel 3 \u2014 Histogram.<\/strong> The histogram shows the distribution of residuals. You&#8217;re looking for something roughly bell-shaped and centred near zero. A distribution centred near zero means the model is <strong>unbiased<\/strong> \u2014 it&#8217;s not systematically over- or under-forecasting. A distribution skewed to the right means the model regularly over-forecasts. Skewed left means systematic under-forecasting. A well-shaped distribution suggests the residuals are plausibly random; a heavily skewed one tells you the forecast will be wrong in a predictable direction.<\/p>\n<p><strong>The Ljung-Box test<\/strong> is the statistical judge that formalises what your eye sees in the ACF. It tests whether the residuals are <em>legally<\/em> allowed to be called noise \u2014 that is, whether the observed autocorrelation pattern could plausibly have arisen by chance. A high p-value (&gt; 0.05) means you cannot reject the &quot;this is noise&quot; hypothesis. A low p-value means there is statistically significant autocorrelation \u2014 the residuals are not noise, the model has missed something. In fpp3, <code>gg_tsresiduals()<\/code> produces all three panels in one call, and you can run the formal test with <code>Box.test()<\/code> on the residuals vector or with <code>features(.resid, ljung_box, lag = 12)<\/code> on the augmented tibble.<\/p>\n<p>Residual diagnostics are not pass\/fail in practice. Almost no real-world model produces perfectly white noise residuals on messy supply chain data. The question is: how much structure is left? Is it economically meaningful? A slight first-lag autocorrelation on monthly demand might be statistically significant but operationally negligible. A strong seasonal pattern in the residuals of what was supposed to be a seasonal model is never negligible.<\/p>\n<p>Look at the residuals. Always.<\/p>\n<h2>The Train\/Test Split \u2014 The Only Honest Accuracy Test<\/h2>\n<p>Here is a statistic that should make you uncomfortable: every forecasting model ever fitted to historical data will look better on that same historical data than on any future data it has never seen.<\/p>\n<p>This is not a flaw of bad models. It is a mathematical certainty for all models. When a model is fitted to a dataset, it adjusts its parameters to explain the observed history as well as possible \u2014 and it will always do this at least a little too well, capturing some of the random noise in that specific history as if it were real signal. Apply the model to new data, and those &quot;signals&quot; are just noise \u2014 and the model is wrong in ways the training accuracy didn&#8217;t predict.<\/p>\n<p><strong>In-sample accuracy is not accuracy. It is a description of the past.<\/strong><\/p>\n<p>The only honest test of forecast skill is holding out data the model has never seen \u2014 fitting the model on an earlier period, then measuring performance on a later period that was deliberately excluded from fitting.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_train_test_split-2.png\" alt=\"Time series with training and test hold-out period shaded, showing where accuracy is measured\" \/><\/p>\n<p>The mechanics are simple. Take your full time series and split it at a cutoff date. Everything before the cutoff becomes the <strong>training set<\/strong> \u2014 the model sees this data and learns from it. Everything after the cutoff is the <strong>test set<\/strong> (sometimes called the hold-out set) \u2014 the model is evaluated here, but these observations were never available during fitting. The accuracy metrics you compute on the test set are what you report. The accuracy on the training set is, at best, a sanity check.<\/p>\n<p>How much data to hold out? The standard guidance is to use a test set at least as long as the forecast horizon you care about. If you want 12-month forecasts, hold out at least 12 months. Longer is generally better \u2014 a hold-out of 6 periods gives you a noisier accuracy estimate than a hold-out of 24 periods. But your training set also needs enough data to fit the model reliably, particularly for seasonal models that need to observe several full seasonal cycles.<\/p>\n<p>In fpp3, <code>filter_index()<\/code> makes this split clean and readable in a single pipe \u2014 see the collapsible R code at the end of the post for the full pipeline.<\/p>\n<p><strong>A preview of what&#8217;s coming.<\/strong> The single train\/test split is the honest test. But it has a weakness: the result depends on where you draw the line. A model that happened to perform well during 2023 might have performed very differently if you&#8217;d held out 2021 instead. <strong>Time series cross-validation<\/strong> \u2014 running the honest test many times, at many cutoff points, and averaging the results \u2014 is the rigorous solution to this problem. That is the technique we&#8217;ll use in the next post to pit six models against each other and pick a winner.<\/p>\n<h2>Interactive Dashboard<\/h2>\n<p>The four metrics only become intuitive once you&#8217;ve <em>felt<\/em> what they do on different demand profiles. You can read about MAPE exploding on slow movers \u2014 or you can turn a slider and watch it happen.<\/p>\n<div class=\"dashboard-link\" style=\"margin:2em 0; padding:1.5em; background:#f8f9fa; border-left:4px solid #0073aa; border-radius:4px;\">\n<p style=\"margin:0 0 0.5em 0; font-size:1.1em;\"><strong>Interactive Dashboard<\/strong><\/p>\n<p style=\"margin:0 0 1em 0;\">Explore the data yourself \u2014 adjust parameters and see the results update in real time.<\/p>\n<p><a href=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/2026-04-11_Forecasters_Toolbox_Accuracy_FPP3_dashboard-2.html\" target=\"_blank\" style=\"display:inline-block; padding:0.6em 1.2em; background:#0073aa; color:#fff; text-decoration:none; border-radius:4px; font-weight:bold;\">Open Interactive Dashboard &rarr;<\/a><\/div>\n<h2>What&#8217;s Next<\/h2>\n<p>This post handed you the toolbox.<\/p>\n<p>You can now look at any forecast and ask the right questions: What benchmark does this beat? Which metric is being quoted, and is it the honest one for this SKU type? What do the residuals look like? Was this accuracy measured in-sample or on a real hold-out?<\/p>\n<p>The next post swings the hammer.<\/p>\n<p><strong>&quot;I Ran 6 Models on Real Demand Data \u2014 Here&#8217;s How I Picked the Winner&quot;<\/strong> takes a single real demand series and runs six models against it \u2014 Naive, Seasonal Naive, MEAN, ETS, STL+ETS, and ARIMA \u2014 using a proper train\/test split and time series cross-validation to crown a winner. The metric I use to pick it? MASE. Everything you learned today is the reason that choice makes sense.<\/p>\n<hr \/>\n<details>\n<summary><strong>Show R Code<\/strong><\/summary>\n<pre><code class=\"language-r\"># ==============================================================================\n# Is Your Forecast Any Good? The Forecaster's Toolbox\n# Generates 7 images for the blog post\n# Run from project root: Rscript Scripts\/generate_toolbox_accuracy_images.R\n# ==============================================================================\n\nlibrary(fpp3)\nlibrary(patchwork)\nlibrary(scales)\n\nsource(&quot;Scripts\/theme_inphronesys.R&quot;)\n\nset.seed(42)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# 1. Synthetic monthly demand series (12 years, trend + seasonality + noise)\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nn_months &lt;- 144  # 12 years\n\ndemand_sim &lt;- tibble(\n  month  = yearmonth(seq(as.Date(&quot;2012-01-01&quot;), by = &quot;month&quot;, length.out = n_months)),\n  demand = 1200 +                                    # base level\n    seq(0, 400, length.out = n_months) +             # upward trend\n    200 * sin(2 * pi * (1:n_months) \/ 12) +          # annual seasonality\n    rnorm(n_months, 0, 80)                           # noise\n) |&gt;\n  mutate(demand = pmax(demand, 50)) |&gt;               # floor at 50\n  as_tsibble(index = month)\n\n# Train\/test split: last 12 months held out\ntrain &lt;- demand_sim |&gt; filter_index(. ~ &quot;2022 Dec&quot;)\ntest  &lt;- demand_sim |&gt; filter_index(&quot;2023 Jan&quot; ~ .)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 1: toolbox_snaive_beats_ets.png\n# ETS vs SNAIVE on hold-out \u2014 SNAIVE wins\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nfit_bench &lt;- train |&gt;\n  model(\n    ETS    = ETS(demand),\n    SNAIVE = SNAIVE(demand)\n  )\n\nfc_bench &lt;- fit_bench |&gt; forecast(h = 12)\n\n# Compute accuracy on test set\nacc_bench &lt;- accuracy(fc_bench, demand_sim)\n\n# Extract forecasts for plotting\nfc_wide &lt;- fc_bench |&gt;\n  as_tibble() |&gt;\n  select(month, .model, .mean) |&gt;\n  pivot_wider(names_from = .model, values_from = .mean)\n\n# Build the plot\np_bench &lt;- demand_sim |&gt;\n  as_tibble() |&gt;\n  mutate(period = if_else(month &gt;= yearmonth(&quot;2023 Jan&quot;), &quot;Test&quot;, &quot;Train&quot;)) |&gt;\n  ggplot(aes(x = month, y = demand)) +\n  # shaded test region background\n  annotate(&quot;rect&quot;,\n    xmin = yearmonth(&quot;2023 Jan&quot;), xmax = max(demand_sim$month),\n    ymin = -Inf, ymax = Inf,\n    fill = iph_colors$lightgrey, alpha = 0.4\n  ) +\n  geom_line(color = iph_colors$navy, linewidth = 0.6) +\n  # SNAIVE forecast\n  geom_line(\n    data = fc_wide,\n    aes(x = month, y = SNAIVE),\n    color = iph_colors$blue, linewidth = 1.0, linetype = &quot;dashed&quot;\n  ) +\n  # ETS forecast\n  geom_line(\n    data = fc_wide,\n    aes(x = month, y = ETS),\n    color = iph_colors$red, linewidth = 1.0, linetype = &quot;dashed&quot;\n  ) +\n  annotate(&quot;text&quot;, x = yearmonth(&quot;2023 Apr&quot;), y = 2050,\n    label = &quot;Seasonal Naive&quot;, color = iph_colors$blue, size = 3.5, fontface = &quot;bold&quot;,\n    hjust = 0) +\n  annotate(&quot;text&quot;, x = yearmonth(&quot;2023 Apr&quot;), y = 1900,\n    label = &quot;ETS&quot;, color = iph_colors$red, size = 3.5, fontface = &quot;bold&quot;,\n    hjust = 0) +\n  annotate(&quot;text&quot;, x = yearmonth(&quot;2023 Mar&quot;), y = max(demand_sim$demand) * 0.97,\n    label = &quot;\u2190 Test period&quot;, color = &quot;grey50&quot;, size = 3.2, hjust = 0.5) +\n  labs(\n    title    = &quot;Seasonal Naive Beats ETS on the Hold-Out&quot;,\n    subtitle = &quot;A simpler model wins on unseen data&quot;,\n    x        = NULL,\n    y        = &quot;Monthly demand (units)&quot;\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_snaive_beats_ets-2.png&quot;, p_bench,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 2: toolbox_mae_vs_rmse.png\n# Two forecasts: identical MAE, different RMSE (one has a huge outlier miss)\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\n# 12 periods; craft errors so MAE is equal\nerrors_A &lt;- c(20, -18, 15, -22, 19, -17, 21, -20, 18, -16, 22, -14)  # avg abs ~ 18.5\nerrors_B &lt;- c(2, -3, 1, -2, 3, -1, 2, -3, 140, -2, 1, -3)\n\n# Force both to identical MAE\ntarget_mae &lt;- 16\nerrors_A &lt;- errors_A * (target_mae \/ mean(abs(errors_A)))\nerrors_B_scaled &lt;- errors_B\nerrors_B_scaled[9] &lt;- 0  # zero out outlier temporarily\nmean_abs_rest &lt;- mean(abs(errors_B_scaled[-9]))\nerrors_B_scaled[-9] &lt;- errors_B_scaled[-9] * (target_mae * 12 \/ 11) \/ mean(abs(errors_B_scaled[-9]))\nerrors_B_scaled[9] &lt;- target_mae * 12 - sum(abs(errors_B_scaled[-9]))\n\nperiod &lt;- 1:12\ndf_errors &lt;- tibble(\n  period   = rep(period, 2),\n  error    = c(errors_A, errors_B_scaled),\n  forecast = rep(c(&quot;Forecast A\\n(uniform errors)&quot;, &quot;Forecast B\\n(one huge miss)&quot;), each = 12)\n)\n\n# MAE and RMSE labels\nmae_A  &lt;- mean(abs(errors_A))\nmae_B  &lt;- mean(abs(errors_B_scaled))\nrmse_A &lt;- sqrt(mean(errors_A^2))\nrmse_B &lt;- sqrt(mean(errors_B_scaled^2))\n\nlabel_df &lt;- tibble(\n  forecast = c(&quot;Forecast A\\n(uniform errors)&quot;, &quot;Forecast B\\n(one huge miss)&quot;),\n  label    = c(\n    sprintf(&quot;MAE = %.0f  |  RMSE = %.0f&quot;, mae_A, rmse_A),\n    sprintf(&quot;MAE = %.0f  |  RMSE = %.0f&quot;, mae_B, rmse_B)\n  )\n)\n\np_mae_rmse &lt;- df_errors |&gt;\n  ggplot(aes(x = period, y = error,\n             fill = if_else(abs(error) &gt; 80, &quot;highlight&quot;, &quot;normal&quot;))) +\n  geom_col(width = 0.7, show.legend = FALSE) +\n  geom_hline(yintercept = 0, color = &quot;grey50&quot;, linewidth = 0.4) +\n  scale_fill_manual(values = c(&quot;highlight&quot; = iph_colors$red, &quot;normal&quot; = iph_colors$blue)) +\n  geom_text(\n    data = label_df,\n    aes(x = 6.5, y = Inf, label = label),\n    inherit.aes = FALSE,\n    vjust = 2, size = 3.3, fontface = &quot;bold&quot;, color = iph_colors$navy\n  ) +\n  facet_wrap(~forecast, nrow = 1) +\n  scale_x_continuous(breaks = 1:12, labels = paste0(&quot;M&quot;, 1:12)) +\n  labs(\n    title    = &quot;Same MAE, Very Different RMSE&quot;,\n    subtitle = &quot;RMSE punishes the outlier miss that MAE misses&quot;,\n    x        = &quot;Period&quot;,\n    y        = &quot;Forecast error (units)&quot;\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mae_vs_rmse-2.png&quot;, p_mae_rmse,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 3: toolbox_mape_explodes.png\n# MAPE on low-volume SKU \u2014 percentage errors blow up near zero demand\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nset.seed(99)\nslow_mover &lt;- tibble(\n  period   = 1:24,\n  actual   = c(rpois(24, lambda = 1.8)),\n  forecast = c(rpois(24, lambda = 2.2))\n) |&gt;\n  mutate(\n    abs_error = abs(actual - forecast),\n    pct_error = if_else(actual &gt; 0, abs_error \/ actual * 100, NA_real_)\n  )\n\n# dual-axis style: bar for actual demand, line for pct error\np_mape &lt;- ggplot(slow_mover) +\n  geom_col(aes(x = period, y = actual),\n           fill = iph_colors$lightgrey, width = 0.7) +\n  geom_line(aes(x = period, y = pct_error \/ 4),\n            color = iph_colors$red, linewidth = 1.1, na.rm = TRUE) +\n  geom_point(aes(x = period, y = pct_error \/ 4),\n             color = iph_colors$red, size = 2.5, na.rm = TRUE) +\n  # label a few exploding points\n  geom_text(\n    data = slow_mover |&gt; filter(pct_error &gt; 150, !is.na(pct_error)),\n    aes(x = period, y = pct_error \/ 4, label = paste0(round(pct_error, 0), &quot;%&quot;)),\n    vjust = -0.8, size = 3, color = iph_colors$red, fontface = &quot;bold&quot;\n  ) +\n  scale_y_continuous(\n    name     = &quot;Actual demand (units)&quot;,\n    sec.axis = sec_axis(~ . * 4, name = &quot;MAPE (%)&quot;)\n  ) +\n  labs(\n    title    = &quot;MAPE Explodes on Slow Movers&quot;,\n    subtitle = &quot;Near-zero demand turns moderate misses into extreme percentages&quot;,\n    x        = &quot;Period&quot;\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mape_explodes-2.png&quot;, p_mape,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 4: toolbox_mase_scalefree.png\n# Same model on 3 SKU scales \u2014 MAPE varies wildly, MASE stays comparable\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nmake_sku &lt;- function(base_level, seed_val) {\n  set.seed(seed_val)\n  n &lt;- 36\n  tibble(\n    month  = yearmonth(seq(as.Date(&quot;2020-01-01&quot;), by = &quot;month&quot;, length.out = n)),\n    demand = base_level +\n      base_level * 0.15 * sin(2 * pi * (1:n) \/ 12) +\n      rnorm(n, 0, base_level * 0.08)\n  ) |&gt;\n    mutate(demand = pmax(demand, 1)) |&gt;\n    as_tsibble(index = month)\n}\n\nsku_list &lt;- list(\n  &quot;SKU A\\n(~10 units\/month)&quot;       = make_sku(10,     1),\n  &quot;SKU B\\n(~1,000 units\/month)&quot;    = make_sku(1000,   2),\n  &quot;SKU C\\n(~100,000 units\/month)&quot;  = make_sku(100000, 3)\n)\n\nget_metrics &lt;- function(ts_data, sku_label) {\n  tr &lt;- ts_data |&gt; filter_index(. ~ &quot;2022 Jun&quot;)\n  te &lt;- ts_data |&gt; filter_index(&quot;2022 Jul&quot; ~ .)\n  \n  fit &lt;- tr |&gt; model(ETS = ETS(demand))\n  fc  &lt;- fit |&gt; forecast(h = 6)\n  acc &lt;- accuracy(fc, ts_data)\n  \n  tibble(\n    sku  = sku_label,\n    MAPE = round(acc$MAPE, 1),\n    MASE = round(acc$MASE, 2)\n  )\n}\n\nmetrics_df &lt;- purrr::imap_dfr(sku_list, get_metrics)\n\n# Side-by-side bar comparison\np_mase_data &lt;- metrics_df |&gt;\n  pivot_longer(c(MAPE, MASE), names_to = &quot;metric&quot;, values_to = &quot;value&quot;)\n\n# Normalise MAPE to share axis with MASE for display (just show them in facets)\np_mase &lt;- p_mase_data |&gt;\n  ggplot(aes(x = sku, y = value, fill = metric)) +\n  geom_col(width = 0.55, show.legend = FALSE) +\n  geom_text(aes(label = value), vjust = -0.4, size = 3.2, fontface = &quot;bold&quot;,\n            color = iph_colors$navy) +\n  geom_hline(data = tibble(metric = &quot;MASE&quot;, yintercept = 1),\n             aes(yintercept = yintercept),\n             linetype = &quot;dashed&quot;, color = iph_colors$red, linewidth = 0.7) +\n  scale_fill_manual(values = c(&quot;MAPE&quot; = iph_colors$lightgrey, &quot;MASE&quot; = iph_colors$blue)) +\n  facet_wrap(~metric, scales = &quot;free_y&quot;) +\n  labs(\n    title    = &quot;MASE Stays Stable Across Volume Scales \u2014 MAPE Doesn't&quot;,\n    subtitle = &quot;MASE = 1 (dashed) is the naive threshold: below = beats naive, above = worse than naive&quot;,\n    x        = NULL,\n    y        = &quot;Metric value&quot;\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_mase_scalefree-2.png&quot;, p_mase,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 5: toolbox_metric_decision_matrix.png\n# Visual decision matrix (metric \u00d7 use case) as styled grid\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nuse_cases &lt;- c(\n  &quot;High-volume stable SKU&quot;,\n  &quot;Slow mover \/ intermittent&quot;,\n  &quot;Mixed portfolio comparison&quot;,\n  &quot;C-suite dashboard&quot;,\n  &quot;Academic \/ model selection&quot;,\n  &quot;Penalise large misses&quot;\n)\n\ndecision_long &lt;- expand_grid(\n  use_case = factor(use_cases, levels = rev(use_cases)),\n  metric   = c(&quot;MAE&quot;, &quot;RMSE&quot;, &quot;MAPE&quot;, &quot;MASE&quot;)\n) |&gt;\n  mutate(\n    rating = case_when(\n      use_case == &quot;High-volume stable SKU&quot;       &amp; metric %in% c(&quot;MAE&quot;,&quot;RMSE&quot;,&quot;MAPE&quot;,&quot;MASE&quot;) ~ &quot;good&quot;,\n      use_case == &quot;Slow mover \/ intermittent&quot;    &amp; metric == &quot;MAPE&quot;                          ~ &quot;bad&quot;,\n      use_case == &quot;Slow mover \/ intermittent&quot;    &amp; metric %in% c(&quot;MAE&quot;,&quot;RMSE&quot;,&quot;MASE&quot;)        ~ &quot;good&quot;,\n      use_case == &quot;Mixed portfolio comparison&quot;   &amp; metric == &quot;MAPE&quot;                          ~ &quot;bad&quot;,\n      use_case == &quot;Mixed portfolio comparison&quot;   &amp; metric %in% c(&quot;MAE&quot;,&quot;RMSE&quot;)               ~ &quot;caution&quot;,\n      use_case == &quot;Mixed portfolio comparison&quot;   &amp; metric == &quot;MASE&quot;                          ~ &quot;good&quot;,\n      use_case == &quot;C-suite dashboard&quot;            &amp; metric %in% c(&quot;MAE&quot;,&quot;MAPE&quot;)               ~ &quot;good&quot;,\n      use_case == &quot;C-suite dashboard&quot;            &amp; metric %in% c(&quot;RMSE&quot;,&quot;MASE&quot;)              ~ &quot;caution&quot;,\n      use_case == &quot;Academic \/ model selection&quot;   &amp; metric == &quot;MAPE&quot;                          ~ &quot;bad&quot;,\n      use_case == &quot;Academic \/ model selection&quot;   &amp; metric %in% c(&quot;MAE&quot;,&quot;RMSE&quot;)                ~ &quot;caution&quot;,\n      use_case == &quot;Academic \/ model selection&quot;   &amp; metric == &quot;MASE&quot;                          ~ &quot;good&quot;,\n      use_case == &quot;Penalise large misses&quot;        &amp; metric == &quot;RMSE&quot;                          ~ &quot;good&quot;,\n      use_case == &quot;Penalise large misses&quot;        &amp; metric == &quot;MAPE&quot;                          ~ &quot;bad&quot;,\n      use_case == &quot;Penalise large misses&quot;        &amp; metric %in% c(&quot;MAE&quot;,&quot;MASE&quot;)               ~ &quot;caution&quot;,\n      TRUE ~ &quot;caution&quot;\n    ),\n    label = case_when(\n      rating == &quot;good&quot;    ~ &quot;\u2713&quot;,\n      rating == &quot;caution&quot; ~ &quot;&#x26a0;&quot;,\n      rating == &quot;bad&quot;     ~ &quot;\u2717&quot;\n    )\n  )\n\np_matrix &lt;- decision_long |&gt;\n  ggplot(aes(x = metric, y = use_case, fill = rating)) +\n  geom_tile(color = &quot;white&quot;, linewidth = 1.5) +\n  geom_text(aes(label = label), size = 6, color = &quot;white&quot;, fontface = &quot;bold&quot;) +\n  scale_fill_manual(\n    values = c(\n      &quot;good&quot;    = iph_colors$blue,\n      &quot;caution&quot; = iph_colors$orange,\n      &quot;bad&quot;     = iph_colors$red\n    ),\n    labels = c(&quot;good&quot; = &quot;Recommended \u2713&quot;, &quot;caution&quot; = &quot;Use with caution &#x26a0;&quot;, &quot;bad&quot; = &quot;Avoid \u2717&quot;),\n    name = NULL\n  ) +\n  labs(\n    title    = &quot;Which Accuracy Metric for Which Situation?&quot;,\n    subtitle = &quot;No single metric is best everywhere&quot;,\n    x        = NULL,\n    y        = NULL\n  ) +\n  theme_inphronesys(grid = &quot;none&quot;) +\n  theme(\n    legend.position = &quot;bottom&quot;,\n    axis.text.x     = element_text(face = &quot;bold&quot;, size = 13)\n  )\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_metric_decision_matrix-2.png&quot;, p_matrix,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 6: toolbox_residual_diagnostics.png\n# 3-panel residual diagnostics: good model (left) vs bad model (right)\n# 800\u00d7700px (multi-panel)\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\n# Good model: well-specified ETS on the synthetic demand\nfit_good &lt;- train |&gt; model(ETS = ETS(demand))\nresid_good &lt;- augment(fit_good) |&gt; pull(.resid)\n\n# Bad model: MEAN() on seasonal data \u2014 will have strong autocorrelation\nfit_bad &lt;- train |&gt; model(MEAN = MEAN(demand))\nresid_bad &lt;- augment(fit_bad) |&gt; pull(.resid)\n\n# Helper: 3-panel residual plot from a vector of residuals\nmake_resid_panels &lt;- function(resids, title_prefix) {\n  n &lt;- length(resids)\n  df &lt;- tibble(t = 1:n, r = resids)\n  \n  # Panel 1: residuals over time\n  p1 &lt;- df |&gt;\n    ggplot(aes(x = t, y = r)) +\n    geom_hline(yintercept = 0, color = &quot;grey60&quot;, linewidth = 0.5) +\n    geom_line(color = iph_colors$blue, linewidth = 0.6) +\n    labs(title = paste0(title_prefix, &quot;: Residuals over time&quot;), x = NULL, y = &quot;Residual&quot;) +\n    theme_inphronesys(grid = &quot;y&quot;)\n  \n  # Panel 2: ACF\n  acf_vals &lt;- acf(resids, plot = FALSE, lag.max = 20)\n  acf_df   &lt;- tibble(lag = acf_vals$lag[-1], acf = acf_vals$acf[-1])\n  ci        &lt;- qnorm(0.975) \/ sqrt(n)\n  \n  p2 &lt;- acf_df |&gt;\n    ggplot(aes(x = lag, y = acf,\n               fill = abs(acf) &gt; ci)) +\n    geom_col(width = 0.7, show.legend = FALSE) +\n    geom_hline(yintercept = c(-ci, ci), linetype = &quot;dashed&quot;,\n               color = iph_colors$navy, linewidth = 0.5) +\n    scale_fill_manual(values = c(&quot;FALSE&quot; = iph_colors$lightgrey, &quot;TRUE&quot; = iph_colors$red)) +\n    labs(title = paste0(title_prefix, &quot;: ACF&quot;), x = &quot;Lag&quot;, y = &quot;ACF&quot;) +\n    theme_inphronesys(grid = &quot;y&quot;)\n  \n  # Panel 3: histogram\n  p3 &lt;- df |&gt;\n    ggplot(aes(x = r)) +\n    geom_histogram(bins = 20, fill = iph_colors$blue, color = &quot;white&quot;, alpha = 0.85) +\n    geom_vline(xintercept = 0, color = iph_colors$red, linetype = &quot;dashed&quot;, linewidth = 0.8) +\n    labs(title = paste0(title_prefix, &quot;: Distribution&quot;), x = &quot;Residual&quot;, y = &quot;Count&quot;) +\n    theme_inphronesys(grid = &quot;y&quot;)\n  \n  list(p1, p2, p3)\n}\n\npanels_good &lt;- make_resid_panels(resid_good, &quot;Good model (ETS)&quot;)\npanels_bad  &lt;- make_resid_panels(resid_bad,  &quot;Bad model (MEAN)&quot;)\n\np_diag &lt;- wrap_plots(\n  panels_good[[1]], panels_bad[[1]],\n  panels_good[[2]], panels_bad[[2]],\n  panels_good[[3]], panels_bad[[3]],\n  ncol = 2\n)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_residual_diagnostics-2.png&quot;, p_diag,\n       width = 8, height = 7, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# IMAGE 7: toolbox_train_test_split.png\n# Time series with training and test hold-out period shaded\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\ncutoff &lt;- yearmonth(&quot;2023 Jan&quot;)\n\np_split &lt;- demand_sim |&gt;\n  as_tibble() |&gt;\n  ggplot(aes(x = month, y = demand)) +\n  annotate(&quot;rect&quot;,\n    xmin = cutoff, xmax = max(demand_sim$month),\n    ymin = -Inf, ymax = Inf,\n    fill = iph_colors$blue, alpha = 0.12\n  ) +\n  geom_line(color = iph_colors$navy, linewidth = 0.75) +\n  annotate(&quot;segment&quot;,\n    x = cutoff, xend = cutoff,\n    y = min(demand_sim$demand) * 0.95, yend = max(demand_sim$demand) * 1.05,\n    color = iph_colors$blue, linewidth = 1.0, linetype = &quot;dashed&quot;\n  ) +\n  annotate(&quot;text&quot;,\n    x = yearmonth(&quot;2019 Jan&quot;), y = max(demand_sim$demand) * 1.03,\n    label = &quot;Training set (model learns here)&quot;,\n    hjust = 0.5, size = 3.4, color = iph_colors$navy, fontface = &quot;bold&quot;\n  ) +\n  annotate(&quot;text&quot;,\n    x = yearmonth(&quot;2023 Jul&quot;), y = max(demand_sim$demand) * 1.03,\n    label = &quot;Test set\\n(accuracy measured here)&quot;,\n    hjust = 0.5, size = 3.4, color = iph_colors$blue, fontface = &quot;bold&quot;\n  ) +\n  labs(\n    title    = &quot;The Only Honest Accuracy Test&quot;,\n    subtitle = &quot;Fit on training data. Evaluate on unseen test data.&quot;,\n    x        = NULL,\n    y        = &quot;Monthly demand (units)&quot;\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/toolbox_train_test_split-2.png&quot;, p_split,\n       width = 8, height = 4, dpi = 100, bg = &quot;white&quot;)\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# Full accuracy comparison (for reference \/ bonus sanity check)\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\nfit_all &lt;- train |&gt;\n  model(\n    ETS      = ETS(demand),\n    SNAIVE   = SNAIVE(demand),\n    NAIVE    = NAIVE(demand),\n    MEAN     = MEAN(demand)\n  )\n\nfc_all &lt;- fit_all |&gt; forecast(h = 12)\naccuracy(fc_all, demand_sim) |&gt;\n  select(.model, MAE, RMSE, MAPE, MASE) |&gt;\n  arrange(MASE) |&gt;\n  print()\n\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n# MAPE on low-volume demonstration (raw numbers, for context)\n# \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\n# MAPE formula: |actual - forecast| \/ actual * 100\n# When actual = 0: undefined \/ Inf\n# When actual = 1, miss of 2: 200%\n# When actual = 100, miss of 2: 2%\n\ndemo_mape &lt;- tibble(\n  actual   = c(100, 10, 2, 1, 0),\n  forecast = c(102, 12, 4, 3, 2),\n  miss     = abs(actual - forecast)\n) |&gt;\n  mutate(\n    MAPE = if_else(actual &gt; 0, miss \/ actual * 100, Inf)\n  )\n\ncat(&quot;\\nMAPE demonstration on different volume levels:\\n&quot;)\nprint(demo_mape)\n\n# Ljung-Box test on ETS residuals\nresid_tsibble &lt;- augment(fit_good) |&gt;\n  select(month, .resid)\n\nlb_result &lt;- Box.test(resid_tsibble$.resid, lag = 12, type = &quot;Ljung-Box&quot;)\ncat(sprintf(\n  &quot;\\nLjung-Box test on ETS residuals (lag 12): X\u00b2 = %.2f, p-value = %.4f\\n&quot;,\n  lb_result$statistic, lb_result$p.value\n))\ncat(&quot;Interpretation: p &gt; 0.05 means residuals are consistent with white noise.\\n&quot;)\n\ncat(&quot;\\nAll 7 images saved to Images\/toolbox_*.png\\n&quot;)\n<\/code><\/pre>\n<\/details>\n<h2>References<\/h2>\n<ul>\n<li>Hyndman, R.J. &amp; Athanasopoulos, G. (2021). <em>Forecasting: Principles and Practice<\/em> (3rd ed.). Chapter 5. https:\/\/otexts.com\/fpp3\/accuracy.html<\/li>\n<li>Hyndman, R.J. &amp; Koehler, A.B. (2006). &quot;Another look at measures of forecast accuracy.&quot; <em>International Journal of Forecasting<\/em>, 22(4), 679\u2013688.<\/li>\n<li>Hyndman, R.J. blog post: &quot;Errors on percentage errors.&quot; https:\/\/robjhyndman.com\/hyndsight\/smape\/<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Four acronyms decide whether you trust a forecast: MAE, MAPE, RMSE, MASE. Here is when each one lies to you \u2014 and the one benchmark that catches them all.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,206,115],"tags":[8,127,278,203,276,280,15,279,277,26],"class_list":["post-1842","post","type-post","status-publish","format-standard","hentry","category-data-science","category-forecasting","category-supply-chain-management","tag-forecasting","tag-fpp3","tag-mae","tag-mape","tag-mase","tag-naive-benchmark","tag-r","tag-residual-diagnostics","tag-rmse","tag-supply-chain-analytics"],"_links":{"self":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1842","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1842"}],"version-history":[{"count":1,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1842\/revisions"}],"predecessor-version":[{"id":1843,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1842\/revisions\/1843"}],"wp:attachment":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}