Machine Learning (ML) and Data Science has received a lot of press over the last 5-10 years. Algorithms with origins in the 1940’s and 1950’s have now become very effective due to exponential increases in data availability and processing power. Some companies have been able to leverage these advances to achieve fantastic success, while many others have struggled to emulate that success and get machine learning to work for them. So what is causing the hold-up for this much-hyped technology?
Most of the companies that have been able to successfully leverage ML are digital native companies; their core products are computer programs. This means that collecting and storing large amounts of accurate data and experimenting in rigidly defined environments (like a website) are second nature. More often than not, companies that are not digital natives are attempting to adopt ML before having an ecosystem in place that can support it.
Here are some common issues holding ML projects back:
If companies aren’t doing ML, they want to be doing it, right? It is sometimes presumed to be the secret sauce that can magically accomplish anything, and that hype can work against prospective projects. ML is immensely effective in the right scenarios, and establishing upfront with a data scientist whether your project is one of those is a very important and surprisingly overlooked step. I’ve seen many ML projects launched prior to anyone qualified in it being consulted and data scientists being hired to solve something with ML without an understanding of whether it’s applicable to the problem. Although exploring a new technology is valuable and important, poorly considered ML FOMO or blindly forging ahead under the banner of Agile leads to the cart being put before the horse.
Solution: A data scientist should be part of the scoping conversation as early as possible to identify if there is an opportunity for ML to make an impact. If a project sounds like a good candidate for automation, a data scientist can tell you what the best way to solve that problem is. If the data scientist is saying it’s not feasible, it’s worth addressing why that is, or looking for an alternative solution.
As is often said: ‘garbage in, garbage out’. Having big data isn’t enough; it has to be good data for an ML algorithm to be effective. If large amounts of data are missing or incorrect, performance of any ML algorithm will suffer. Collecting data and getting data into a usable state is a vastly underestimated process, both by data scientists and by project leaders. Between an initial point of contact for an ML project and developing a production model, several months can go by before the Extract-Transform-Load (ETL) is functioning as needed for a production level model to work. If the data is discovered to be poor quality at this point, lots of energy has already been wasted.
Solution: Have a data quality assessment done after a successful POC, but before going to development. This could suggest what the performance limits are given the data quality, and highlight the opportunity to improve the quality of that data collection and storage. At the very least, some comment should be made about what risks the data quality presents before development.
There are a few reasons why ML triggers loss aversion for stakeholders.
Whatever process your project might be aiming to enhance or replace, its error will be on full display. This is intentional, as any good data scientist wants as accurate an appraisal of their model performance as possible; that error will not be 0 as machine learning models are probabilistic. An initial model might be wrong 15% of the time, 20% of the time, and accepting that many mistakes is a hard sell, even if it could be a usable model.
However, just because you can’t see the mistakes happening in your current process so explicitly, it doesn’t mean they don’t exist. Machines are expected to be wholly reliable, while humans are not, perhaps because dealing with a failure you can’t foresee is easier than accepting a guaranteed amount of failures, leading to slow adoption of less-than-perfect models.
Solution: Allow for your existing process to be quantified as part of your project and as a benchmark for machine learning to beat. Agreeing on metrics and KPIs will give everyone involved a clear vision of what is trying to be achieved. This will give all stakeholders the most confidence that an algorithm is competitive. If it really isn’t possible to say what success looks like up-front, the only recourse is real-world experimentation to collect more data. This might be expensive but it’s better than guesswork.
In summary, scoping projects carefully to see if they have the elements in place for success is the best approach for ML projects. Consult early, check the data, agree the metrics, understand the end user. However, even if all the necessary precautions are taken, the end goal can’t be achieved. But doing your due diligence can certainly help dodge the pitfalls.