{"id":1874,"date":"2026-04-16T22:10:28","date_gmt":"2026-04-16T22:10:28","guid":{"rendered":"https:\/\/inphronesys.com\/?p=1874"},"modified":"2026-04-16T22:10:28","modified_gmt":"2026-04-16T22:10:28","slug":"the-m5-lesson-why-simple-still-beats-fancy-in-supply-chain-forecasting","status":"publish","type":"post","link":"https:\/\/inphronesys.com\/?p=1874","title":{"rendered":"The M5 Lesson: Why Simple Still Beats Fancy in Supply Chain Forecasting"},"content":{"rendered":"<p>In 2020, <strong>7,092 people<\/strong> spent months forecasting Walmart sales for a share of a $100,000 prize pool. The winner didn&#8217;t use a transformer. They didn&#8217;t use an LSTM. They used <strong>gradient-boosted trees<\/strong> \u2014 the same 2017-era algorithm your data science intern probably knows. And a well-tuned <strong>Seasonal Naive<\/strong> \u2014 one of the simplest forecasts ever invented, literally &quot;what happened last week will happen next week&quot; \u2014 was shockingly hard to beat.<\/p>\n<p>This is the story of the M5 competition. And six years later, as your LinkedIn feed fills up with &quot;Time Series Foundation Models will transform forecasting,&quot; it is the single most inconvenient receipt in the field.<\/p>\n<h2>What the M5 competition actually was<\/h2>\n<p>The M-competitions, organised since 1982 by Prof. Spyros Makridakis, are the closest thing forecasting has to the Olympics. M5, run on Kaggle in 2020, was the first edition that used <strong>real, messy, SKU-level retail data<\/strong> \u2014 the kind of data supply chain people actually stare at on Monday mornings.<\/p>\n<p>The ingredients:<\/p>\n<ul>\n<li><strong>42,840 hierarchical time series<\/strong> from <strong>Walmart<\/strong> \u2014 3,049 products \u00d7 10 stores \u00d7 3 US states, five years of daily history<\/li>\n<li>At the most granular level, <strong>30,490 SKU-store series<\/strong> \u2014 the level your ERP actually plans against<\/li>\n<li>A <strong>28-day-ahead<\/strong> forecast horizon, evaluated via the Weighted Root Mean Squared Scaled Error (<strong>WRMSSE<\/strong>)<\/li>\n<li><strong>5,507 teams<\/strong> (<strong>7,092 participants<\/strong> from 101 countries)<\/li>\n<li>Real-world noise: <strong>intermittent demand<\/strong>, promotions, SNAP food-stamp effects, holidays, the works<\/li>\n<\/ul>\n<p>This was not a toy dataset. It was the first large, public benchmark where the messy stuff we live with every day was baked in. Whatever came out of M5 is the single best public evidence we have about what forecasting techniques actually work on retail demand.<\/p>\n<h2>The embarrassing result for deep learning<\/h2>\n<p>Here is the headline: <strong>48.4% of M5 teams beat a simple Naive forecast. Only 35.8% beat Seasonal Naive. And only 7.5% of all M5 teams beat the best pure statistical benchmark<\/strong> \u2014 a bottom-up exponential smoothing model the organisers pre-computed and handed everyone at the starting line.<\/p>\n<p>Think about that for a second. Roughly two-thirds of teams \u2014 armed with GPUs, deep nets, ensembles, feature engineering \u2014 could not do better than &quot;sales next Tuesday equals sales last Tuesday.&quot;<\/p>\n<p>The benchmarks were especially unkind to deep learning. A shallow multi-layer-perceptron neural net scored a WRMSSE of <strong>0.977<\/strong>. Seasonal Naive scored <strong>0.847<\/strong>. Read that again:<\/p>\n<blockquote>\n<p>The shallow neural net couldn&#8217;t beat a one-line Seasonal-Naive baseline.<\/p>\n<\/blockquote>\n<p>The winner, team <strong>YJ_STU<\/strong> (Yeonjun In, a solo competitor), delivered a WRMSSE of <strong>0.520<\/strong> \u2014 <strong>22.4% better than the best statistical benchmark (ES_bu = 0.671)<\/strong>. Impressive. But not with a transformer. Not with an RNN. With a <strong>LightGBM<\/strong> ensemble: <strong>220 gradient-boosted tree models in total, 6 per series<\/strong>. Four of the top five teams used LightGBM as their base model. The third-place team used an ensemble of 43 LSTM-based deep networks \u2014 still not a transformer, still not a foundation model. The runner-up used an N-BEATS neural net \u2014 but only as a multiplicative adjustment on top of a LightGBM base forecast, not as the base forecast itself.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_leaderboard_gap-1.png\" alt=\"Leaderboard gap \u2014 Seasonal Naive, Croston, MLP, ES_bu, and the LightGBM winner compared by WRMSSE\" \/><\/p>\n<p>Summarised:<\/p>\n<table style=\"border-collapse:collapse; width:100%; margin:1.5em 0; font-size:0.95em; line-height:1.5;\">\n<thead>\n<tr>\n<th style=\"border:1px solid #ddd; padding:10px 14px; background:#0073aa; color:#fff; font-weight:600; text-align:left;\">Method<\/th>\n<th style=\"border:1px solid #ddd; padding:10px 14px; background:#0073aa; color:#fff; font-weight:600; text-align:left;\">WRMSSE (overall)<\/th>\n<th style=\"border:1px solid #ddd; padding:10px 14px; background:#0073aa; color:#fff; font-weight:600; text-align:left;\">Verdict<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"background:#f8f9fa;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Naive<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">1.752<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Non-seasonal baseline<\/td>\n<\/tr>\n<tr style=\"background:#ffffff;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Seasonal Naive<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">0.847<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Beat by only 35.8% of teams<\/td>\n<\/tr>\n<tr style=\"background:#f8f9fa;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Croston (intermittent)<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">0.957<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Classic intermittent method, underwhelming overall<\/td>\n<\/tr>\n<tr style=\"background:#ffffff;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">MLP (shallow NN)<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">0.977<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\"><strong>Worse than Seasonal Naive<\/strong><\/td>\n<\/tr>\n<tr style=\"background:#f8f9fa;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">ES_bu (best statistical)<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">0.671<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">Only 7.5% of teams beat this<\/td>\n<\/tr>\n<tr style=\"background:#ffffff;\">\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">LightGBM (YJ_STU, winner)<\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\"><strong>0.520<\/strong><\/td>\n<td style=\"border:1px solid #ddd; padding:9px 14px; text-align:left;\">22.4% better than ES_bu<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Cheaper, simpler, older methods were everywhere near the top of the leaderboard. That is not a minor footnote \u2014 that is the whole story.<\/p>\n<h2>Where models actually differ: the hierarchy effect<\/h2>\n<p>There&#8217;s a second layer of nuance that most hype-cycle LinkedIn posts miss entirely: <strong>the winner&#8217;s margin depends massively on how aggregated your forecast is.<\/strong><\/p>\n<p>M5 evaluates forecasts at 12 hierarchical levels, from Level 1 (total Walmart sales) all the way down to Level 12 (a single SKU in a single store). The WRMSSE at each level tells very different stories.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_error_by_hierarchy-1.png\" alt=\"Error by aggregation level \u2014 the winner crushes at the total level but barely edges out at SKU-store\" \/><\/p>\n<p>At <strong>Level 1 (total sales)<\/strong>, the winner scored <strong>0.199<\/strong> versus ES_bu&#8217;s <strong>0.426<\/strong> \u2014 a <strong>53.3% improvement<\/strong>. Crushing. Write a press release.<\/p>\n<p>At <strong>Level 12 (SKU-store, the level your MRP actually plans against)<\/strong>, the winner scored <strong>0.884<\/strong> versus ES_bu&#8217;s <strong>0.915<\/strong> \u2014 a <strong>3.39% improvement<\/strong>. Barely a rounding error. And at that level, Seasonal Naive comes in at 1.176 \u2014 not great, but not a disaster either.<\/p>\n<p>Why the huge gap? <strong>Aggregation smooths noise.<\/strong> When you add up thousands of intermittent SKU series into a total, the zeroes average out, the weekly rhythm is clean, and sophisticated models have lots of signal to work with. When you drop down to one product in one store, half the days are zero, promos hit unpredictably, and the model has almost nothing to learn from. At that point, &quot;what did this SKU do last Thursday?&quot; is roughly as good as anything else you can throw at it.<\/p>\n<p>If you only ever forecast at the total-company level, yes, invest in fancy ML. If you forecast at the SKU-store level \u2014 which is where <strong>every single ERP and MRP system operates<\/strong> \u2014 the gap between a clever baseline and a leaderboard-winning model is a few percentage points.<\/p>\n<h2>Why simple models win on supply chain data<\/h2>\n<p>Once you see the hierarchy effect, the rest of M5 falls into place. Supply chain data has four features that punish complex models:<\/p>\n<ol>\n<li><strong>Short series.<\/strong> Most SKUs have a couple of years of history at best. New products have weeks. Deep learning wants tens of thousands of observations per series \u2014 you have a few hundred.<\/li>\n<li><strong>Intermittent demand.<\/strong> Most SKU-store series are zero-heavy. A neural net trained on sparse Poisson-like counts happily converges to &quot;predict the mean&quot; \u2014 which is often what Seasonal Naive already does, with less code.<\/li>\n<li><strong>Structural breaks.<\/strong> Promos, assortment changes, new store openings, SNAP timing, COVID. The last 90 days often look nothing like the previous two years. Complex models over-fit the old regime.<\/li>\n<li><strong>Limited covariates.<\/strong> You have price, day-of-week, a holiday flag, maybe a promo indicator. You don&#8217;t have the 400-feature pipeline a transformer was trained on in a tech-company benchmark paper.<\/li>\n<\/ol>\n<p><img decoding=\"async\" src=\"https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_naive_vs_models_series-1.png\" alt=\"Three forecasts on one noisy SKU \u2014 Seasonal Naive, ETS, and a LightGBM-family model visibly converge on the same shape\" \/><\/p>\n<p>Look at the three forecasts above on a single simulated-but-realistic Walmart-like SKU. Seasonal Naive, ETS, and a LightGBM-family gradient-boosting model all produce <strong>fundamentally similar shapes<\/strong>: a weekly rhythm, weekends higher, base rate somewhere in the right ballpark. On a thin, noisy series, there is only so much information to extract. Everybody arrives at roughly the same answer. The winner on the leaderboard is the one who tuned the bias\u2013variance trade-off best \u2014 not the one with the biggest model.<\/p>\n<h2>The pragmatic forecasting stack<\/h2>\n<p>So what should you actually do on Monday morning? The M5 evidence points to a simple 1\u20132\u20133 progression:<\/p>\n<ol>\n<li><strong>Seasonal Naive is your baseline.<\/strong> Not your final model \u2014 your baseline. Every forecast you ship should beat it on your test set. If it doesn&#8217;t, you don&#8217;t have a model, you have a random number generator with extra steps.<\/li>\n<li><strong>ETS or ARIMA is your working horse.<\/strong> Exponential smoothing and ARIMA-family models are cheap to fit, fast to retrain, easy to diagnose. On aggregated levels they are often within 5\u201310% of the M5 winner. That is usually good enough to unblock your planning process.<\/li>\n<li><strong>Gradient-boosted trees (LightGBM, XGBoost) when the data justifies it.<\/strong> Specifically: when you have enough series to share information across them (hierarchical\/global models), enough covariates to engineer meaningful features (price, promo, weather, holiday), and enough retraining cadence to keep up with drift. This is where the M5 winner lived.<\/li>\n<\/ol>\n<p>Notice what&#8217;s <strong>not<\/strong> on that list: deep learning at the SKU-store level. LSTMs, transformers, and foundation models can earn their place in supply chain \u2014 but not by default, and not because a vendor slide deck told you they would.<\/p>\n<h2>What this means for foundation models (2026 edition)<\/h2>\n<p>And now we arrive at today. Your feed is full of <strong>Time Series Foundation Models<\/strong> \u2014 TimeGPT, Chronos, Moirai, TimesFM, a new one every month. The pitch is identical to the one deep learning made in 2018: <em>&quot;pre-trained on millions of series, zero-shot performance, no feature engineering, just plug in your data.&quot;<\/em><\/p>\n<p>The M5 lesson is almost too obvious: <strong>the supply chain data that broke deep learning hasn&#8217;t magically healed itself for foundation models<\/strong>. The series are still short. The zeros are still zeros. Your promos are still your promos, not the promos in the pretraining corpus.<\/p>\n<p>The early empirical evidence is consistent with this. Where foundation models shine is <strong>aggregated, clean, long, pattern-rich series<\/strong> \u2014 macro, finance, energy, traffic. Where they struggle is SKU-level retail demand with intermittency and structural breaks. Sound familiar?<\/p>\n<p>This doesn&#8217;t mean foundation models are useless. Zero-shot on a brand-new SKU with no history? Potentially valuable. A starting point for fine-tuning on your company&#8217;s data? Absolutely. The failure mode isn&#8217;t &quot;foundation models don&#8217;t work.&quot; The failure mode is <strong>believing the zero-shot demo replaces the boring work of baselining, fine-tuning, and measuring.<\/strong><\/p>\n<p>Same lesson, new coat of paint.<\/p>\n<h2>The takeaway<\/h2>\n<p>If you remember one thing from this post, make it this: <strong>reject any forecasting pitch, internal or external, that doesn&#8217;t report MASE or RMSSE against a Seasonal Naive baseline.<\/strong> No baseline, no conversation. That single discipline will save your team more money and more credibility than any model you&#8217;ll buy this year.<\/p>\n<p>The M5 competition handed supply chain forecasters a gift and a warning. The gift: we finally have a public, SKU-level benchmark that reflects what we actually do. The warning: the best teams in the world, with unlimited compute and five months to optimise, beat a one-line Seasonal Naive by a margin that fits comfortably inside your forecast error bars.<\/p>\n<p>So next time a vendor tells you their foundation model will transform your demand plan out of the box, smile politely and ask one question: <em>&quot;What&#8217;s the WRMSSE against Seasonal Naive on my data?&quot;<\/em> If they can&#8217;t answer, you have your answer.<\/p>\n<h2>References<\/h2>\n<ul>\n<li>Makridakis, S., Spiliotis, E., &amp; Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation. <em>International Journal of Forecasting<\/em>, 38(4), 1325\u20131336. <a href=\"https:\/\/doi.org\/10.1016\/j.ijforecast.2021.07.007\">DOI: 10.1016\/j.ijforecast.2021.07.007<\/a><\/li>\n<li>Makridakis, S., Spiliotis, E., &amp; Assimakopoulos, V. (2022). M5 accuracy competition: Results, findings, and conclusions. <em>International Journal of Forecasting<\/em>, 38(4), 1346\u20131364. <a href=\"https:\/\/doi.org\/10.1016\/j.ijforecast.2021.11.013\">DOI: 10.1016\/j.ijforecast.2021.11.013<\/a><\/li>\n<li>Makridakis, S., Spiliotis, E., Assimakopoulos, V., Chen, Z., Gaba, A., Tsetlin, I., &amp; Winkler, R. L. (2022). The M5 uncertainty competition: Results, findings, and conclusions. <em>International Journal of Forecasting<\/em>, 38(4), 1365\u20131385.<\/li>\n<li>Hyndman, R. J., &amp; Athanasopoulos, G. (2021). <em>Forecasting: Principles and Practice<\/em> (3rd ed.). OTexts. <a href=\"https:\/\/otexts.com\/fpp3\/\">otexts.com\/fpp3<\/a><\/li>\n<\/ul>\n<details>\n<summary><strong>Show R Code<\/strong><\/summary>\n<pre><code class=\"language-r\"># =============================================================================\n# generate_m5_images.R\n# -----------------------------------------------------------------------------\n# Generates the 3 PNG images for the blog post:\n#   &quot;The M5 Lesson: Why Simple Models Still Beat Fancy Ones in SC Forecasting&quot;\n#\n# Outputs (all 800x500 @ dpi=100, bg=&quot;white&quot;):\n#   1. https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_leaderboard_gap-1.png\n#   2. https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_naive_vs_models_series-1.png\n#   3. https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_error_by_hierarchy-1.png\n#\n# Data provenance\n# ---------------\n# Charts 1 and 3 use the REAL published numbers from:\n#   Makridakis, Spiliotis, Assimakopoulos (2022).\n#   &quot;M5 accuracy competition: Results, findings, and conclusions.&quot;\n#   International Journal of Forecasting, 38(4), 1346-1364.\n#   DOI: https:\/\/doi.org\/10.1016\/j.ijforecast.2021.07.007\n#   - Table B (Appendix B): WRMSSE of the 24 benchmarks, overall and per level.\n#   - Table 3: WRMSSE of the top 50 teams, overall and per level.\n#   - Table 5: Winner (YJ_STU) improvement over Croston per level.\n#\n# Chart 2 is a SIMULATED reproduction (not real Walmart data \u2014 those files\n# are not on disk). The simulated SKU has the intermittent + weekly seasonal\n# + promo-spike character described in the M5 paper (Section 3) and is used\n# purely to show what the three forecast shapes (Seasonal Naive, ETS,\n# Gradient Boosted Trees \/ LightGBM) look like side-by-side.\n# =============================================================================\n\n# --- Packages ---------------------------------------------------------------\nreq &lt;- function(pkg) {\n  if (!requireNamespace(pkg, quietly = TRUE)) {\n    install.packages(pkg, repos = &quot;https:\/\/cloud.r-project.org\/&quot;)\n  }\n}\nfor (p in c(&quot;ggplot2&quot;, &quot;dplyr&quot;, &quot;tidyr&quot;, &quot;scales&quot;, &quot;patchwork&quot;,\n            &quot;tsibble&quot;, &quot;fable&quot;, &quot;fabletools&quot;, &quot;feasts&quot;, &quot;lightgbm&quot;)) {\n  req(p)\n}\n\nsuppressPackageStartupMessages({\n  library(ggplot2)\n  library(dplyr)\n  library(tidyr)\n  library(scales)\n  library(patchwork)\n  library(tsibble)\n  library(fable)\n  library(fabletools)\n  library(feasts)\n  library(lightgbm)\n})\n\nsource(&quot;Scripts\/theme_inphronesys.R&quot;)\n\nset.seed(42)\n\n# =============================================================================\n# CHART 1 \u2014 Leaderboard gap: simple methods vs winner\n# =============================================================================\n# Source: Makridakis et al. 2022 Table B (overall column, &quot;Average&quot;)\n#         plus Table 3 row 1 (YJ_STU winner overall 0.520).\n# All values are WRMSSE (lower = better).\n\nleaderboard &lt;- tibble::tribble(\n  ~method,                         ~wrmsse,  ~role,\n  &quot;Naive&quot;,                         1.752,    &quot;baseline&quot;,\n  &quot;Seasonal Naive&quot;,                0.847,    &quot;baseline&quot;,\n  &quot;Croston (intermittent)&quot;,        0.957,    &quot;statistical&quot;,\n  &quot;MLP (shallow neural net)&quot;,      0.977,    &quot;ml_weak&quot;,\n  &quot;ES_bu (best statistical)&quot;,      0.671,    &quot;statistical&quot;,\n  &quot;LightGBM (M5 winner, YJ_STU)&quot;,  0.520,    &quot;winner&quot;\n)\nleaderboard$method &lt;- factor(leaderboard$method, levels = leaderboard$method)\n\nrole_colors &lt;- c(\n  baseline     = iph_colors$lightgrey,\n  statistical  = iph_colors$grey,\n  ml_weak      = iph_colors$orange,\n  winner       = iph_colors$blue\n)\n\np1 &lt;- ggplot(leaderboard, aes(x = method, y = wrmsse, fill = role)) +\n  geom_col(width = 0.7) +\n  geom_text(aes(label = sprintf(&quot;%.3f&quot;, wrmsse)),\n            vjust = -0.4, size = 3.6, color = iph_colors$dark,\n            family = &quot;Inter&quot;, fontface = &quot;bold&quot;) +\n  geom_hline(yintercept = 0.671, linetype = &quot;dashed&quot;,\n             color = iph_colors$grey, linewidth = 0.4) +\n  annotate(&quot;text&quot;, x = 1, y = 0.671, vjust = -0.4, hjust = 0,\n           label = &quot;ES_bu benchmark (0.671)&quot;,\n           color = iph_colors$grey, size = 3.2, family = &quot;Inter&quot;) +\n  annotate(&quot;text&quot;, x = 6, y = 0.52, hjust = 0.5, vjust = 1.8,\n           label = &quot;22.4% better\\nthan ES_bu&quot;,\n           color = &quot;white&quot;, size = 3.1, family = &quot;Inter&quot;, fontface = &quot;bold&quot;) +\n  scale_fill_manual(values = role_colors, guide = &quot;none&quot;) +\n  scale_y_continuous(limits = c(0, 2.05),\n                     breaks = seq(0, 2, 0.5),\n                     expand = expansion(mult = c(0, 0.02))) +\n  labs(\n    title    = &quot;The M5 leaderboard gap is smaller than the hype&quot;,\n    subtitle = &quot;WRMSSE on 42,840 Walmart SKU series, 28-day horizon. Lower = better.&quot;,\n    x = NULL, y = &quot;WRMSSE (average across 12 levels)&quot;,\n    caption = paste(\n      &quot;Source: Makridakis, Spiliotis &amp; Assimakopoulos (2022),&quot;,\n      &quot;IJF 38(4), Table B (benchmarks) and Table 3 (winner). DOI: 10.1016\/j.ijforecast.2021.07.007&quot;\n    )\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;) +\n  theme(axis.text.x = element_text(angle = 20, hjust = 1, size = 10))\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_leaderboard_gap-1.png&quot;, p1,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n\n# =============================================================================\n# CHART 2 \u2014 One SKU series with 3 forecasts overlaid\n# =============================================================================\n# Simulated intermittent demand, 180 days of history, 28-day forecast.\n# Weekly seasonality (weekends higher), one promo spike around day 140.\n# Three forecasts: Seasonal Naive, ETS, LightGBM.\n\nn_hist    &lt;- 180\nn_fcst    &lt;- 28\nstart_dt  &lt;- as.Date(&quot;2024-10-01&quot;)\nall_dates &lt;- seq.Date(start_dt, by = &quot;day&quot;, length.out = n_hist + n_fcst)\ntrain_dt  &lt;- all_dates[1:n_hist]\ntest_dt   &lt;- all_dates[(n_hist + 1):(n_hist + n_fcst)]\n\ndow      &lt;- as.POSIXlt(train_dt)$wday      # 0=Sun..6=Sat\nweekend  &lt;- as.integer(dow %in% c(0, 6))\npromo    &lt;- as.integer(train_dt &gt;= as.Date(&quot;2025-02-10&quot;) &amp;\n                         train_dt &lt;= as.Date(&quot;2025-02-14&quot;))\n\n# Weekday base rates (Walmart-like: peak Sat\/Sun)\nbase_rate &lt;- c(1.8, 0.6, 0.5, 0.5, 0.8, 1.2, 2.4)[dow + 1]\nlambda    &lt;- base_rate * (1 + 5 * promo)\nactual_train &lt;- rpois(n_hist, lambda)\n\n# Held-out &quot;actuals&quot; for the 28-day test window (for visual reference only).\ndow_test     &lt;- as.POSIXlt(test_dt)$wday\nbase_test    &lt;- c(1.8, 0.6, 0.5, 0.5, 0.8, 1.2, 2.4)[dow_test + 1]\nactual_test  &lt;- rpois(n_fcst, base_test)\n\n# Build training tsibble\ntsb &lt;- tsibble::tsibble(\n  date = train_dt,\n  sales = actual_train,\n  index = date\n)\n\n# --- Seasonal Naive (weekly) ---\nfit_snaive &lt;- tsb |&gt; fabletools::model(snaive = fable::SNAIVE(sales ~ lag(&quot;week&quot;)))\nfc_snaive  &lt;- fabletools::forecast(fit_snaive, h = n_fcst) |&gt;\n  as_tibble() |&gt; mutate(model = &quot;Seasonal Naive&quot;,\n                        point = .mean) |&gt;\n  select(date, model, point)\n\n# --- ETS ---\nfit_ets &lt;- tsb |&gt; fabletools::model(ets = fable::ETS(sales))\nfc_ets  &lt;- fabletools::forecast(fit_ets, h = n_fcst) |&gt;\n  as_tibble() |&gt; mutate(model = &quot;ETS&quot;,\n                        point = .mean) |&gt;\n  select(date, model, point)\n\n# --- LightGBM with simple lag features ---\nmake_features &lt;- function(dates, series) {\n  tibble::tibble(\n    date    = dates,\n    sales   = series,\n    dow     = as.POSIXlt(dates)$wday,\n    weekend = as.integer(as.POSIXlt(dates)$wday %in% c(0, 6)),\n    lag7    = dplyr::lag(series, 7),\n    lag14   = dplyr::lag(series, 14),\n    lag28   = dplyr::lag(series, 28),\n    ma7     = zoo::rollmean(series, k = 7, fill = NA, align = &quot;right&quot;)\n  )\n}\n# zoo is a base dependency of many packages; fall back if missing\nif (!requireNamespace(&quot;zoo&quot;, quietly = TRUE)) install.packages(&quot;zoo&quot;)\nlibrary(zoo)\n\nfeat_train &lt;- make_features(train_dt, actual_train) |&gt;\n  tidyr::drop_na()\n\nX_train &lt;- as.matrix(feat_train[, c(&quot;dow&quot;, &quot;weekend&quot;, &quot;lag7&quot;, &quot;lag14&quot;, &quot;lag28&quot;, &quot;ma7&quot;)])\ny_train &lt;- feat_train$sales\n\ndtrain &lt;- lightgbm::lgb.Dataset(data = X_train, label = y_train)\nlgb_params &lt;- list(\n  objective       = &quot;regression&quot;,\n  metric          = &quot;rmse&quot;,\n  learning_rate   = 0.05,\n  num_leaves      = 15,\n  min_data_in_leaf = 5,\n  verbose         = -1\n)\nlgb_fit &lt;- lightgbm::lgb.train(\n  params = lgb_params,\n  data   = dtrain,\n  nrounds = 400,\n  verbose = -1\n)\n\n# Recursive multi-step forecast\nfull_series &lt;- actual_train\nlgb_pred &lt;- numeric(n_fcst)\nfor (i in seq_len(n_fcst)) {\n  d        &lt;- test_dt[i]\n  dow_i    &lt;- as.POSIXlt(d)$wday\n  wknd_i   &lt;- as.integer(dow_i %in% c(0, 6))\n  cur_idx  &lt;- length(full_series)\n  lag7_i   &lt;- full_series[cur_idx - 7 + 1]\n  lag14_i  &lt;- full_series[cur_idx - 14 + 1]\n  lag28_i  &lt;- full_series[cur_idx - 28 + 1]\n  ma7_i    &lt;- mean(full_series[(cur_idx - 6):cur_idx])\n  x_new    &lt;- matrix(c(dow_i, wknd_i, lag7_i, lag14_i, lag28_i, ma7_i), nrow = 1)\n  colnames(x_new) &lt;- c(&quot;dow&quot;, &quot;weekend&quot;, &quot;lag7&quot;, &quot;lag14&quot;, &quot;lag28&quot;, &quot;ma7&quot;)\n  yhat     &lt;- predict(lgb_fit, x_new)\n  yhat     &lt;- pmax(yhat, 0)\n  lgb_pred[i] &lt;- yhat\n  full_series &lt;- c(full_series, yhat)\n}\nfc_lgb &lt;- tibble::tibble(\n  date  = test_dt,\n  model = &quot;LightGBM (M5 winner family)&quot;,\n  point = lgb_pred\n)\n\n# --- Combine for plotting ---\nhistory_df &lt;- tibble::tibble(\n  date = c(train_dt, test_dt),\n  sales = c(actual_train, actual_test),\n  segment = c(rep(&quot;History&quot;, n_hist), rep(&quot;Held-out actuals&quot;, n_fcst))\n)\n\nforecasts_df &lt;- bind_rows(fc_snaive, fc_ets, fc_lgb)\nforecasts_df$model &lt;- factor(forecasts_df$model,\n  levels = c(&quot;Seasonal Naive&quot;, &quot;ETS&quot;, &quot;LightGBM (M5 winner family)&quot;)\n)\n\n# Show only last 70 days of history to keep the forecast zone readable\nzoom_start &lt;- max(train_dt) - 42\n\np2 &lt;- ggplot() +\n  geom_line(data = filter(history_df, date &gt;= zoom_start, segment == &quot;History&quot;),\n            aes(x = date, y = sales),\n            color = iph_colors$grey, linewidth = 0.5) +\n  geom_point(data = filter(history_df, date &gt;= zoom_start, segment == &quot;History&quot;),\n             aes(x = date, y = sales),\n             color = iph_colors$grey, size = 1.2) +\n  geom_line(data = filter(history_df, segment == &quot;Held-out actuals&quot;),\n            aes(x = date, y = sales),\n            color = iph_colors$dark, linewidth = 0.4, linetype = &quot;dotted&quot;) +\n  geom_point(data = filter(history_df, segment == &quot;Held-out actuals&quot;),\n             aes(x = date, y = sales),\n             color = iph_colors$dark, size = 1.4, shape = 1) +\n  geom_line(data = forecasts_df,\n            aes(x = date, y = point, color = model),\n            linewidth = 0.9) +\n  geom_vline(xintercept = as.numeric(max(train_dt)),\n             linetype = &quot;dashed&quot;, color = iph_colors$grey, linewidth = 0.4) +\n  annotate(&quot;text&quot;, x = max(train_dt), y = Inf,\n           label = &quot; forecast horizon \\u2192&quot;, hjust = 0, vjust = 1.6,\n           family = &quot;Inter&quot;, color = iph_colors$grey, size = 3.2) +\n  scale_color_manual(values = c(\n    &quot;Seasonal Naive&quot;                 = iph_colors$orange,\n    &quot;ETS&quot;                            = iph_colors$green,\n    &quot;LightGBM (M5 winner family)&quot;    = iph_colors$blue\n  )) +\n  scale_x_date(date_breaks = &quot;2 weeks&quot;, date_labels = &quot;%b %d&quot;) +\n  scale_y_continuous(breaks = scales::pretty_breaks(5),\n                     expand = expansion(mult = c(0, 0.05))) +\n  labs(\n    title    = &quot;Three forecasts, one noisy SKU: which one wins?&quot;,\n    subtitle = &quot;Simulated Walmart-like intermittent daily demand \\u2014 28-day forecast after 180 days of history.&quot;,\n    x = NULL,\n    y = &quot;Daily units sold&quot;,\n    color = NULL,\n    caption = paste(\n      &quot;Simulated SKU (set.seed(42); weekly seasonality + promo spike).&quot;,\n      &quot;Shapes match the M5 paper's Section 3 description.&quot;,\n      &quot;On real M5 data, gap between SNAIVE\/ETS and the LightGBM winner at level-12 is ~5\\u201310%.&quot;\n    )\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;)\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_naive_vs_models_series-1.png&quot;, p2,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n\n# =============================================================================\n# CHART 3 \u2014 Error by hierarchy level (the real story of M5)\n# =============================================================================\n# Source:\n#   Naive, sNaive, ES_bu, MLP rows  -&gt; Table B (Appendix B).\n#   YJ_STU winner row               -&gt; Table 3, row 1.\n# Lower WRMSSE is better.\n\nlvl_data &lt;- tibble::tribble(\n  ~level, ~Naive, ~sNaive, ~ES_bu, ~MLP,   ~Winner,\n  1,      1.967,  0.560,   0.426,  0.892,  0.199,\n  2,      1.904,  0.673,   0.514,  0.942,  0.310,\n  3,      1.880,  0.718,   0.580,  0.974,  0.400,\n  4,      1.947,  0.623,   0.478,  0.910,  0.277,\n  5,      1.914,  0.708,   0.557,  0.972,  0.365,\n  6,      1.881,  0.760,   0.577,  0.965,  0.390,\n  7,      1.878,  0.829,   0.654,  1.016,  0.474,\n  8,      1.798,  0.801,   0.643,  0.984,  0.480,\n  9,      1.764,  0.888,   0.728,  1.026,  0.573,\n  10,     1.479,  1.223,   1.012,  1.084,  0.966,\n  11,     1.360,  1.205,   0.969,  1.014,  0.929,\n  12,     1.253,  1.176,   0.915,  0.943,  0.884\n)\n\nlevel_labels &lt;- c(\n  &quot;1\\n(Total)&quot;, &quot;2\\n(State)&quot;, &quot;3\\n(Store)&quot;,\n  &quot;4\\n(Category)&quot;, &quot;5\\n(Dept)&quot;, &quot;6\\n(State-Cat)&quot;,\n  &quot;7\\n(State-Dept)&quot;, &quot;8\\n(Store-Cat)&quot;, &quot;9\\n(Store-Dept)&quot;,\n  &quot;10\\n(Product)&quot;, &quot;11\\n(Prod-State)&quot;, &quot;12\\n(SKU-Store)&quot;\n)\n\nlvl_long &lt;- lvl_data |&gt;\n  tidyr::pivot_longer(-level, names_to = &quot;method&quot;, values_to = &quot;wrmsse&quot;) |&gt;\n  dplyr::filter(method != &quot;Naive&quot;) |&gt;    # Naive off-scale; omit to keep resolution\n  dplyr::mutate(method = factor(method,\n    levels = c(&quot;sNaive&quot;, &quot;MLP&quot;, &quot;ES_bu&quot;, &quot;Winner&quot;),\n    labels = c(&quot;Seasonal Naive&quot;, &quot;MLP (shallow NN)&quot;,\n               &quot;ES_bu (best statistical)&quot;, &quot;LightGBM (M5 winner)&quot;)\n  ))\n\np3 &lt;- ggplot(lvl_long, aes(x = level, y = wrmsse,\n                           color = method, linewidth = method)) +\n  geom_line(alpha = 0.95) +\n  geom_point(size = 2) +\n  annotate(&quot;rect&quot;, xmin = 9.5, xmax = 12.5, ymin = -Inf, ymax = Inf,\n           alpha = 0.08, fill = iph_colors$red) +\n  annotate(&quot;text&quot;, x = 11, y = 0.30,\n           label = &quot;SKU zone:\\nwinner's edge\\nnearly vanishes&quot;,\n           color = iph_colors$red, size = 3.3, family = &quot;Inter&quot;,\n           fontface = &quot;bold&quot;, lineheight = 0.9) +\n  annotate(&quot;text&quot;, x = 1, y = 0.08,\n           label = &quot;Aggregate zone:\\nwinner crushes&quot;,\n           color = iph_colors$blue, size = 3.3, family = &quot;Inter&quot;,\n           fontface = &quot;bold&quot;, lineheight = 0.9, hjust = 0) +\n  scale_color_manual(values = c(\n    &quot;Seasonal Naive&quot;          = iph_colors$grey,\n    &quot;MLP (shallow NN)&quot;        = iph_colors$orange,\n    &quot;ES_bu (best statistical)&quot; = iph_colors$navy,\n    &quot;LightGBM (M5 winner)&quot;    = iph_colors$blue\n  )) +\n  scale_linewidth_manual(values = c(\n    &quot;Seasonal Naive&quot;          = 0.6,\n    &quot;MLP (shallow NN)&quot;        = 0.6,\n    &quot;ES_bu (best statistical)&quot; = 0.8,\n    &quot;LightGBM (M5 winner)&quot;    = 1.2\n  ), guide = &quot;none&quot;) +\n  scale_x_continuous(breaks = 1:12, labels = level_labels) +\n  scale_y_continuous(breaks = seq(0, 1.4, 0.2),\n                     expand = expansion(mult = c(0, 0.05))) +\n  labs(\n    title    = &quot;Where the M5 winner actually wins \\u2014 and where it doesn't&quot;,\n    subtitle = &quot;WRMSSE at each of the 12 M5 aggregation levels. Lower = better. Naive omitted (off-scale).&quot;,\n    x = &quot;Aggregation level (coarse \\u2192 fine)&quot;,\n    y = &quot;WRMSSE&quot;,\n    color = NULL,\n    caption = paste(\n      &quot;Source: Makridakis et al. (2022), IJF 38(4).&quot;,\n      &quot;Table B rows (sNaive, MLP, ES_bu) and Table 3 row 1 (YJ_STU LightGBM winner).&quot;,\n      &quot;DOI: 10.1016\/j.ijforecast.2021.07.007&quot;\n    )\n  ) +\n  theme_inphronesys(grid = &quot;y&quot;) +\n  theme(axis.text.x = element_text(size = 8, lineheight = 0.9))\n\nggsave(&quot;https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_error_by_hierarchy-1.png&quot;, p3,\n       width = 8, height = 5, dpi = 100, bg = &quot;white&quot;)\n\n\n# =============================================================================\n# Confirmation\n# =============================================================================\ncat(&quot;\\nGenerated:\\n&quot;,\n    &quot; https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_leaderboard_gap-1.png\\n&quot;,\n    &quot; https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_naive_vs_models_series-1.png\\n&quot;,\n    &quot; https:\/\/inphronesys.com\/wp-content\/uploads\/2026\/04\/m5_error_by_hierarchy-1.png\\n&quot;)\n<\/code><\/pre>\n<\/details>\n","protected":false},"excerpt":{"rendered":"<p>The 2020 M5 competition taught a lesson the forecasting world keeps forgetting: on real supply chain data, simple models win more often than you&#8217;d think. Here&#8217;s what the Walmart SKU benchmark actually showed \u2014 and why it matters for today&#8217;s Time Series Foundation Model hype.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[206,20],"tags":[125,8,287,286,283,284,285],"class_list":["post-1874","post","type-post","status-publish","format-standard","hentry","category-forecasting","category-supply-chain","tag-ets","tag-forecasting","tag-foundation-models","tag-lightgbm","tag-m5","tag-machine-learning-2","tag-naive-forecasting"],"_links":{"self":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1874","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1874"}],"version-history":[{"count":1,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1874\/revisions"}],"predecessor-version":[{"id":1875,"href":"https:\/\/inphronesys.com\/index.php?rest_route=\/wp\/v2\/posts\/1874\/revisions\/1875"}],"wp:attachment":[{"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inphronesys.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}