Any solution to the shortage of machine learning expertise requires answering this question: whether it’s so we know what skills to teach, what tools to build, or what processes to automate. 10mins read.
In thinking about how we can automate some of the work of machine learning, as well as how to make it more accessible to people with a wider variety of backgrounds, it’s first necessary to ask, what is it that machine learning practitioners do? Any solution to the shortage of machine learning expertise requires answering this question: whether it’s so we know what skills to teach, what tools to build, or what processes to automate.
Building Data Products is Complex Work
While many academic machine learning sources focus almost exclusively on predictive modeling, that is just one piece of what machine learning practitioners do in the wild. The processes of appropriately framing a business problem, collecting and cleaning the data, building the model, implementing the result, and then monitoring for changes are interconnected in many ways that often make it hard to silo off just a single piece (without at least being aware of what the other pieces entail). As Jeremy Howard et al. wrote in Designing great data products, Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing.
A team from Google, D. Sculley et al., wrote the classic Machine Learning: The High-Interest Credit Card of Technical Debt about the code complexity and technical debt often created when using machine learning in practice. The authors identify a number of system-level interactions, risks, and anti-patterns, including:
- glue code: massive amount of supporting code written to get data into and out of general-purpose packages
- pipeline jungles: the system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output
- re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems
- risk that changes in the external world may make models or input signals change behavior in unintended ways, and these can be difficult to monitor
The authors write, A remarkable portion of real-world “machine learning” work is devoted to tackling issues of this form… It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… It may be surprising to the academic community to know that only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”. (emphasis mine)
When machine learning projects fail
I identified some failure modes in which machine learning projects are not effective in the workplace:
- The data science team builds really cool stuff that never gets used. There’s no buy-in from the rest of the organization for what they’re working on, and some of the data scientists don’t have a good sense of what can realistically be put into production.
- There is a backlog with data scientists producing models much faster than there is engineering support to put them in production.
- The data infrastructure engineers are separate from the data scientists. The pipelines don’t have the data the data scientists are asking for now, and the data scientists are under-utilizing the data sources the infrastructure engineers have collected.
- The company has definitely decided on feature/product X. They need a data scientist to gather some data that supports this decision. The data scientist feels like the PM is ignoring data that contradicts the decision; the PM feels that the data scientist is ignoring other business logic.
- The data science team interviews a candidate with impressive math modeling and engineering skills. Once hired, the candidate is embedded in a vertical product team that needs simple business analytics. The data scientist is bored and not utilizing their skills.
I framed these as organizational failures in my original post, but they can also be described as various participants being overly focused on just one slice of the complex system that makes up a full data product. These are failures of communication and goal alignment between different parts of the data product pipeline.
So, what do machine learning practitioners do?
As suggested above, building a machine learning product is a multi-faceted and complex task. Here are some of the things that machine learning practitioners may need to do during the process:
Understanding the context:
- identify areas of the business that could benefit from machine learning
- communicate with other stakeholders about what machine learning is and is not capable of (there are often many misconceptions)
- develop understanding of business strategy, risks, and goals to make sure everyone is on the same page
- identify what kind of data the organization has
- appropriately frame and scope the task
- understand operational constraints (e.g. what data is actually available at inference time)
- proactively identify ethical risks, including how your work could be mis-used by harassers, trolls, authoritarian governments, or for propaganda/disinformation campaigns (and plan how to reduce these risks)
- identify potential biases and potential negative feedback loops
- make plans to collect more of different data (if needed and if possible)
- stitch together data from many different sources: often this data has been collected in different formats or with inconsistent conventions
- deal with missing or corrupted data
- visualize the data
- create appropriate training, validation, and test sets
- choose which model to use
- fit model resource needs into constraints (e.g. will the completed model need to run on an edge device, in a low memory or high latency environment, etc)
- choose hyperparameters (e.g. in the case of deep learning, this includes choosing an architecture, loss function, and optimizer)
- train the model (and debug why it’s not training). This can involve:
- adjusting hyperparmeters (e.g. such as the learning rate)
- outputing intermediate results to see how the loss, training error, and validation error are changing with time
- inspecting the data the model is wrong on to look for patterns
- identifying underlying errors or issues with the data
- realizing you need to change how you clean and pre-process the data
- realizing you need more or different data augmentation
- realizing you need more or different data
- trying out different models
- identifying if you are under- or over-fitting
- creating an API or web app with your model as an endpoint in order to productionize
- exporting your model into the needed format
- plan for how often your model will need to be retrained with updated data (e.g. perhaps you will retrain nightly or weekly)
- track model performance over time
- monitor the input data, to identify if it changes with time in a way that would invalidate your model
- communicate your results to the rest of the organization
- have a plan in place for how you will monitor and respond to mistakes or unexpected consequences
Certainly, not every machine learning practitioner needs to do all of the above steps, but components of this process will be a part of many machine learning applications. Even if you are working on just a subset of these steps, a familiarity with the rest of the process will help ensure that you are not overlooking considerations that would keep your project from being successful!
repost with permission. source