Meta has released an evaluation of its new model named Chameleon, which is considered a response to advanced models from Frontier Labs, as competition in the generative artificial intelligence field shifts towards multi-modal models.
Chameleon was designed multi-modally from the start, rather than assembling separate components in various ways. Although Meta has not yet released the model, initial experiments indicate advanced performance in various tasks, such as commenting on images and answering visual questions, while maintaining efficiency in text-based tasks only.
The Chameleon model opens up new uses for artificial intelligence that require a deep understanding of visual and textual information. This model is based on a new approach to training multi-modal models, treating texts and images as distinct features. In contrast to previous methods, Chameleon uses the Unified Transformer architecture, abandoning separate encoding and decoding units for diverse methods as seen in other architectures like Unified-IO 2.
The model is designed to leverage a complex mix of images, text, code instructions, and other media. Chameleon converts images into completely separate distinctive symbols, much like linguistic models handle words, using a unified vocabulary consisting of texts, images, and code instructions.
According to researchers, Gemini is the model closest to Chameleon, despite using separate Google image encoding modules in the creation stage, whereas Chameleon comprehensively processes and generates distinctive symbols.
Chameleon was trained in two stages using a massive dataset containing 4.4 trillion distinctive symbols from texts, images, textual sequences, and interleaved sequences. The city police succeeded in training a 34-billion-parameter thief using 10 trillion multi-modal distinctive symbols.
Based on experiments mentioned in the research paper, “Chameleon” can effectively perform a variety of textual and multi-modal tasks. It has shown advanced performance in answering visual questions and criteria for commenting on images, surpassing models like “Flamingo,” “Aidifics,” and “Lava-1.5”.
“Chameleon” offers new capabilities for creating and generating mixed-media content, especially when responses are needed that combine texts and images. In one test, Meta’s model showed that individuals prefer using “Chameleon” over “Gemini Pro” and “GPT-4V” due to the quality of mixed responses to open questions that integrate both images and texts.