DeepMind has open-sourced Perceiver IO, a general-purpose deep-learning model architecture that can handle many different types of inputs and outputs. Perceiver IO can serve as a “drop-in” replacement for Transformers that performs as well or better than baseline models, but without domain-specific assumptions.
The model architecture was described in a post on DeepMind’s blog. Perceiver IO builds on the Transformer architecture, using attention to map inputs into a latent representation space. Unlike Transformers, however, Perceiver IO allows for longer input sequences without incurring quadratic compute and memory costs. The model can accept simultaneous inputs of multiple data types, including text, images, and sound, and can likewise decode the latent representation into any desired output data type. In a series of experiments, the DeepMind research team evaluated the model’s performance on a variety of tasks, including language understanding, image classification, optical flow, and video-game playing; the model performed “comparable to state-of-the-art models” on most tasks and achieved new state-of-the-art results on the Sintel optical flow benchmark. According to the researchers,
We hope our latest [paper] and the code available on Github help researchers and practitioners tackle problems without needing to invest the time and effort to build custom solutions using specialised systems. As we continue to learn from exploring new kinds of data, we look forward to further improving upon this general-purpose architecture and making it faster and easier to solve problems throughout science and machine learning.
Most deep-learning models are based on architectures designed for a particular type of data; for example, computer vision (CV) models typically use convolutional neural networks (CNN), while natural language processing (NLP) models are based on a sequence-learning architecture such as the Transformer. Systems that handle multi-modal data, such as Google’s combined vision-language model, are often designed to process the different input data types with domain-specific architectures, then combine them using a third module. While some researchers have used the Transformer architecture in CV problems, these typically begin by applying a CNN-based pre-processing step. Additionally, the compute and memory resources required by a Transformer increase with the square of input sequence length, making them an impractical choice for many high-dimensional data types.
To solve this complexity problem, the Perceiver IO architecture uses cross-attention to project high-dimensional input arrays into a lower-dimensional latent space. Then this latent space is processed using a standard Transformer self-attention structure. Because this latent space has a much smaller dimension than the input, the Transformer module processing it can be much deeper than is practical with a Transformer that directly processes large input arrays. Finally, the latent representation is converted to an output by applying a query array that has the same number of elements as the desired output data.
The research team evaluated Perceiver IO on a variety of tasks from different domains. For NLP, the team used the GLUE benchmark. Compared to a BERT model that required the same number of FLOPs, Perceiver IO performed slightly better. For CV, the researchers used the ImageNet benchmark. They tested Perceiver IO with several variations of positional feature inputs, some of which did contain 2D spatial information. With these, the model performed comparable to the ResNet-50 baseline. When using only learned positional features—that is, without giving the input any knowledge of the 2D image structure—the model performed slightly worse: 72.7% accuracy vs ResNet-50’s 78.6%. However, according to the authors “this is the best result by any model on ImageNet without 2D architectural or feature information.” To further demonstrate its capabilities, the team used Perceiver IO as a “drop-in” replacement for the Transformer module in an AlphaStar video-game playing AI and achieved the same level of performance as the original.
In a Twitter discussion about the work, DeepMind researcher Carl Doersch noted,
I admit I half expected it to memorize the train set, and prepared a bunch of clever ideas to force generalization, but these weren’t needed. I’m still a bit in disbelief that it worked so well out-of-the-box.
The Performer IO code is available on GitHub.