MathWorks is the developer of mathematical computing software for engineers and scientists, whose flagship product is MATLAB. Recently, Heather Gorr, Ph.D., senior MATLAB product manager, MathWorks, discussed the importance of streaming high-frequency data, data preparation for machine learning, and best practices for putting it all together to implement a successful AI model.
Where do you see ML and AI being used most right now?
Heather Gorr: AI is everywhere. Machine learning models are now integrated into applications in nearly every industry, from manufacturing equipment, to medical devices, to automated vehicles and more. Each of these systems incorporate hundreds or thousands of sensors that continuously stream hardware-generated data. This data is fed into an AI-powered model, which produces an equally constant stream of predictions that are sent to a database, dashboard, or other device.
How are current processes for data prep and model development keeping up?
HG: Such high-frequency output rates from AI-powered models frequently exacerbate existing data preparation and model development challenges. Consider temperature or pressure monitoring equipment, for example: Each sensor has a slightly different sampling rate which must be synchronized into a single data set with unified times and readings before the data streamed can be analyzed. Knowing where to start can be difficult, but techniques exist to help.
What does it mean to “synchronize data”?
HG: Imagine trying to establish a schedule for eating meals and going to bed while working from home. One might start by synchronizing their clocks, which makes it easier to both set times and keep track of how well actual meal and sleep times compare against set goals. Now imagine someone programmed a model to identify their ideal meal and sleep times based on the times recorded over the course of a few weeks; those synchronized clocks help the model process the data. If each clock in the house were off by a few minutes, the data may need to be aligned and time differences accounted for in the modeling phase.
How does that relate to the concept to synchronized streaming data?
HG: The concept behind synchronizing streaming data is similar, but the larger scale of the data makes the processing stage more complicated. It requires planning, with careful consideration paid to the data generated, sampling rates, and system requirements.
What are the challenges in data prep for AI and ML?
HG: When developing an AI processing model, it often makes sense to consider the desired outcome first and ensure it aligns with the data being fed into the model. However, when feeding a model streaming data, the rest of the system must be considered as well. Perhaps the data has a specific time step or sample rate (for example, hourly or 10 seconds). The remaining data can then be aligned with one of the datasets’ original timestamps.
Say the average sample rate for equipment sensors is 1000 Hz, with 0.001 seconds between data points. This means the streaming system will produce an average of 1,000 samples per second. To accommodate this, an engineer can create a time vector that runs from 0 to 1 seconds, with a target sample rate of 1000 Hz and time step of 0.001 seconds, then “resample” the data based on the new parameters.
Key to the art of data synchronization is deciding how to fill in mismatched data points. In many cases, the original data is resampled, with several common methods for doing so listed in the below table. The correct method to use depends on the alignment between data points and application requirements.
What are the steps involved?
HG: When data scientists are unsure of the alignment between multiple datasets, a common solution is to fill in the gaps with missing data (such as an outer join) or constant value. This can be a helpful first step, especially when sensors are involved. Exploring and visualizing the resulting data will help an engineer decide how best to proceed.
If the datasets closely align, any of the resampling methods shown above can be used. When the datasets are less aligned, however, interpolating or aggregating the data is more common. Imagine converting hourly data into daily data: How can 24 hours’ worth of data be represented in a single data point? In this example, an aggregation—the daily mean—is an appropriate solution. For non-numeric data, the mode, count, or nearest neighbor methods are common.