Data is the new fool’s gold, especially in the supply chain.

Let’s dispel a few notions right away.

1) Data is the new gold.
2) Garbage in garbage out.

These are both categorically false.

Let’s tackle these one by one.

Data is the new gold:

In some cases that’s right, but in reality, the amount of data we produce hourly outpaces the amount of information that the entirety of humanity had produced before the year 2000. A simple explanation of supply and demand curves will tell us that each piece of data gets only incrementally more valuable. But since supply and demand only account for individual items, and not an entire market place. It’s a one-dimensional measure. Let’s look a little deeper. The amount of data being generated now is so large that no single computer can download it fast enough, and certainly no computer can ingest and train on all that data in real time. This is why we have models.

But anyone who’s studied the scantest amount of statistics will realize the almost all data (99.9999999999999+%) is literal noise. It’s not that there is no useful data, of course there is an enormous amount of useful data, but it’s more occupies a smaller percentage of “data” then it ever has.

Let’s go through a small thought experiment…

If data were actually “worth its weight in gold” who would benefit most from this and how could it be leveraged?

The easiest answer is anyone who operates in a market place. I’m thinking stock traders, insurance brokers, and yes freight brokers. If all information were worthwhile and one could collect all of it, there would be no more market. Here’s the easy counterexample. Forecasts and predictions ought to be worth something, but as numerous books (“A Random Walk Down Wall Street”, “The Black Swan”, “Thinking Fast and Slow”,”Liar’s Poker”, “The Signal and the Noise”,”Thinking in Bets”,”The Misbehavior of Markets”) have pointed out, the vast majority of forecasts and predictions are rubbish (no offense meant to rubbish). Only the smallest iota of realized data is worth anything. Data mining therefore becomes mining for gold, but if we study the history of gold mining, one will find most of the discoveries are plain dirt and rocks.

Now let’s turn to garbage in garbage out. As we have seen, almost all data and models are garbage (again, I apologize to garbage as I don’t mean to bring is all the way down to the level of modern machine learning models or the data that feeds them). Should we expect that nearly all results therefore are garbage? We can obviously, empirically measure differently. There are times at which garbage goes in and spectacularly good results come out. The correct phrase is
“Noise in noise out.” With so much noise in the world, one can guarantee almost surely (in the rigorous academic definition of almost surely) that there will be very many spurious correlations. Correlations, as we all know do not imply causation, but sometimes even spurious correlations land on spectacularly accurate models (in nearly in measure one can conceive).

So, how then should we deal with this in the supply chain? The amount of data and money floating around logistics is simply mind-boggling. But as we’ve just laid out, most of that data will neither be useful, nor able to be captured. If you’ve ever met a freight broker, a truck driver, train conductor or ship captain, they have a deep level of expertise and are able to feel out situations better than computers. Why is this? Computers without a doubt have more data, and even more useful data, so they should be able to capture the proper signals better, right?
Well, we can simply use our eyes, ears, and wallets to realize this doesn’t happen. What gives?

We as humans have very special computers in our heads which do an extraordinary job or filtering out noise. Experts are people who have learned how to filter out the most amount of that noise. In particular, humans (and especially women) have shown themselves to be less fooled by spurious correlations than computers. Let’s look at something as simple as linear regression. If we have heavily correlated pieces of data, a linear regression cannot tell them apart. This is where humans excel. Generally speaking, experts in fields will know which signal to observe and when.
Is fuel price affecting margins today or is it wildfires? What is the most likely event to cause a delay in shipments today?
Experts in fields will be able to guess the correct answers quickly with only the tiniest amount of data available to a machine learning algorithm.

To be clear, I’m not advocating foregoing data entirely. As we’ve already discussed, the amount of useful data is far greater than it ever has been. The trick is to figure out how little of it one can get away with. This will increase speed, reduce computing costs (tremendously), and increase the consistency of models by orders of magnitude. All this without affecting accuracy, precision, f1 score, Receiver Operating Curve Integral or any other metric you may like. Use as much data as is useful, but not one point more.

Share This Story on Social

Scroll to Top