Invent before extraction: Training a proprietary machine learning model
If you want to train a machine learning model, you first need a data set to train it on.
If you’ve got even a passing familiarity with machine learning (ML), you’re probably aware of this on some level. If someone else is providing the data set (i.e. you purchased it or pulled it from another project), you can go straight to the next steps of extracting data, transforming it for training purposes, and loading it into the model (a process called ETL). In these circumstances, the data set itself requires relatively little attention.
Many useful ML models, though, must be trained on data you collect yourself. As machine learning spreads into a wider range of applications, we find ourselves using it to classify things that nobody else has tried to classify—some of the most interesting and useful models, for example, are based on data extracted from sensors or images. Developing that data set into something you can train your model on is a task unto itself though, often requiring just as much care and attention as more familiar aspects of ML-like model interpretation, data drift, or extraction.
It also takes a different set of skills, especially if you’re collecting sensor- or image-based data. Situationally specific object detection and image classification is a different animal compared to poking at MNIST in a Jupyter notebook. In some projects we’ve worked on here at Smart Design, this required building a device that could both generate a data set for training a model and then use the model to perform inference.
The handoff between these two tasks is fuzzy: creating a clean, well-labeled data set uses many of the same tools that training a model does, and as the data set gets bigger and better, different aspects of collection and classification can be automated. In our experience, it’s useful to think of data set generation and model training as a single, integrated process, that involves both hardware and software design. It breaks down roughly into four phases:
Phase 1. Manual collection and classification
Phase 2. Automated collection, manual classification
Phase 3. Monitored, automated collection and classification
Phase 4. Fully automated collection, training, and inference
Phase 1. Manual collection and classification - Foraging for Data
“The primary goal of this phase is validation. Your team is attempting to start a data stream from scratch“
Phase 2. Automated collection, manual classification - Opening up the Data Flow
Technology
From Machine Learning to developing for the IoT
Phase 3. Monitored, automated collection and classification - Automating and Codifying
”As the set grows in complexity and detail, a developer can use it to start training a simplified ML model to further speed up the process“
Phase 4. Fully automated collection, training, and inference - Moving to production
Conclusion
The more diversified your data stream, the more effectively it will train your ML model. But greater diversity also brings greater complication: in collecting a wide enough range of entries, in validating them, and in automating their classification. This why we’ve learned it’s best to validate your data source and collection process first (Phase 1), using as little infrastructure as possible, before focusing on analysis and automation.
If you’re starting from scratch, data collection is your tightest bottleneck; leaving analysis until after your data pipeline is moving smoothly is the best way to avoid days or weeks of wasted effort. This is critical, because any kind of manual entry or analysis by developers is time they could be spending on building and training a better model.
The technology team at Smart Design have gone through this process on a variety of projects, the Gatorade Gx Sweat Patch being one recent example. If there’s one critical principle we’ve learned, it’s that effort and organization early on can streamline the process later, getting you to a useful ML model faster. If we know we’re developing a model that’s going to use real-world data, we know to budget time and expertise for the first few phases, knowing that it will taper off as the model improves. Developers who really understand what it takes to pull good data from a camera or sensor can make all the difference in this kind of project—if they’re involved from the beginning.
About Tyler Sanborn