Invent before extraction: Training a proprietary machine learning model

Technical Product Manager

If you want to train a machine learning model, you first need a data set to train it on.

If you’ve got even a passing familiarity with machine learning (ML), you’re probably aware of this on some level. If someone else is providing the data set (i.e. you purchased it or pulled it from another project), you can go straight to the next steps of extracting data, transforming it for training purposes, and loading it into the model (a process called ETL). In these circumstances, the data set itself requires relatively little attention.
Many useful ML models, though, must be trained on data you collect yourself. As machine learning spreads into a wider range of applications, we find ourselves using it to classify things that nobody else has tried to classify—some of the most interesting and useful models, for example, are based on data extracted from sensors or images. Developing that data set into something you can train your model on is a task unto itself though, often requiring just as much care and attention as more familiar aspects of ML-like model interpretation, data drift, or extraction. 
It also takes a different set of skills, especially if you’re collecting sensor- or image-based data. Situationally specific object detection and image classification is a different animal compared to poking at MNIST in a Jupyter notebook. In some projects we’ve worked on here at Smart Design, this required building a device that could both generate a data set for training a model and then use the model to perform inference.
The handoff between these two tasks is fuzzy: creating a clean, well-labeled data set uses many of the same tools that training a model does, and as the data set gets bigger and better, different aspects of collection and classification can be automated. In our experience, it’s useful to think of data set generation and model training as a single, integrated process, that involves both hardware and software design. It breaks down roughly into four phases:
Phase 1. Manual collection and classification

Phase 2. Automated collection, manual classification

Phase 3. Monitored, automated collection and classification

Phase 4. Fully automated collection, training, and inference

Phase 1. Manual collection and classification - Foraging for Data

To understand how these phases progress, and the effort each one requires, it’s helpful to imagine a hypothetical use case. Imagine you’re a company that makes herbal teas, and intend to expand into a range of teas that use foraged plants. To streamline the process, you want an ML model that can instantly identify your desired plants from a quick snapshot of foliage. There’s no pre-existing data set you can use to train your model, so your first task is to go out and take a bunch of photos of foliage, manually identify the useful plants in them, and enter the results into a database. 
This is Phase 1 of the process, and whether you’re collecting photos of plants, weather data from a sensor suite, or recordings of human speech, you’re looking at an extremely manual, hands-on process. A technician has to set up and configure the hardware that collects the data, someone has to operate it, a developer needs to create processes for turning the hardware’s output into something that can be easily reviewed and classified, and a subject matter expert has to do the classification. 
The primary goal of this phase is validation. Your team is attempting to start a data stream from scratch, which brings numerous unknowns: 

Is the sensor or image-processing technology you’re using able to capture the right data?

What pre-processing is necessary to output something that can be classified?

How reliable is classification, even if done manually by an expert?

How wide a range of categories will you need to define in order to have a useful data set?

Success in this phase requires answering these questions, and adjusting the collection process to reliably provide usable outputs that can be reliably classified.
In our herbal tea example, you can think of a technician, a botanist, and a camera operator walking through the forest, looking for plants they want identified. The botanist says what kind of plant it is, the technician sets up the camera, and the operator takes the photo. Then a second technician processes and stores the photo, along with its identifying information.

“The primary goal of this phase is validation. Your team is attempting to start a data stream from scratch“

Phase 2. Automated collection, manual classification - Opening up the Data Flow

To transition to the second phase of the process, the team needs to streamline the information-gathering step. For any data stream that depends on sensors or cameras, this falls to the developer or technician who sets them up and runs them.
After several rounds of manually gathering and identifying images/readings/recordings, the developer should be able to write a few lines of Python (for example) that allows a non-developer to simply point the camera and shoot—or an equivalent action—and produce a useful output that can be stored and sorted. A simple application takes the place of one of the developers, who can move onto something else (with a better understanding of how to approach data set development).
Note that this remains a labor-intensive process, since identification is still done manually (in our example, a botanist still says what each plant is), but collecting the data is no longer a technically involved task. This speeds up gathering and sorting, helping the team build a data set large enough that it can be leveraged to further automate the process in the next phase.


From Machine Learning to developing for the IoT

Learn more about technology at Smart Design


From Machine Learning to developing for the IoT

Learn more about technology at Smart Design

Phase 3. Monitored, automated collection and classification - Automating and Codifying

Integrating data collection with ML training allows them to piggyback off each other. Even a relatively small data set of a few hundred entries can be used to generate simple rules-based classification systems, which accelerates the next round of data gathering. As the set grows in complexity and detail, a developer can use it to start training a simplified ML model to further speed up the process, which will ultimately mature into the full-fledged model everyone’s after.
In Phase 3, there’s a gradual hand-off from manual to automated classification. The developer gradually expands the range of inputs that can be automatically interpreted, first using rules-based criteria, then simple ML models. This continues until this developer, too, can move on. Data capture, classification, and storage are all automated, with manual tasks limited to selecting a target for capture, and checking that classification has occurred successfully. The ML model is still far from complete, but a robust data pipeline is now in place that feeds it the information it needs to improve, with little outside effort.
In the plant identification example, we can imagine that the team is no longer in a forest, but in a botanic garden, where every useful plant has been identified and labeled. At this point, the operator can act more or less independently, walking around and snapping photo after photo, making sure that the label is captured along with each plant. The operator still checks the photos to make sure they’re clear and identifiable, but the botanist is no longer needed. The botanic garden itself is the training data set, large and robust enough to train a variety of ML models on an ongoing basis. 

”As the set grows in complexity and detail, a developer can use it to start training a simplified ML model to further speed up the process“

Phase 4. Fully automated collection, training, and inference - Moving to production

The final phase is reached when the ML model is trained to the point that even the operator is no longer necessary. At this point any end user can use the model to capture a photo, reading, recording, or other information source, and receive a reliable classification with no other human involvement. 
Returning to the plant analogy: the model is now an app that any tea company employee can download onto a device, shoot a photo of a plant, and know what it is and whether or not to pick it.
Training the ML model is complicated, of course, but it’s also a well-understood process that can tap into a robust ecosystem of advice and tools for getting it right. The difference in this case is that, rather than plugging an existing, large, clean, well-classified data set into the model, we’ve connected it to a data pipeline that’s been gradually developed over weeks or months.


The more diversified your data stream, the more effectively it will train your ML model. But greater diversity also brings greater complication: in collecting a wide enough range of entries, in validating them, and in automating their classification. This why we’ve learned it’s best to validate your data source and collection process first (Phase 1), using as little infrastructure as possible, before focusing on analysis and automation.

If you’re starting from scratch, data collection is your tightest bottleneck; leaving analysis until after your data pipeline is moving smoothly is the best way to avoid days or weeks of wasted effort. This is critical, because any kind of manual entry or analysis by developers is time they could be spending on building and training a better model.

The technology team at Smart Design have gone through this process on a variety of projects, the Gatorade Gx Sweat Patch being one recent example. If there’s one critical principle we’ve learned, it’s that effort and organization early on can streamline the process later, getting you to a useful ML model faster. If we know we’re developing a model that’s going to use real-world data, we know to budget time and expertise for the first few phases, knowing that it will taper off as the model improves. Developers who really understand what it takes to pull good data from a camera or sensor can make all the difference in this kind of project—if they’re involved from the beginning.

About Tyler Sanborn

Tyler Sanborn is a Technical Product Manager with a background spanning ad agencies, startups, and digital publishing. He brings expertise in technology and has worked across the healthcare, pharma, apparel, and financial sectors. His notable clients include Vans, Blue Cross Blue Shield of California, and Truth, the anti-big tobacco campaign. He holds a degree from Emerson College and has a passion for computer science history, and all things automotive.

Want to learn about Machine Learning at Smart Design?