Multimodal AI

Multimodal AI combines various

Deep learning algorithms have trained the underlying models on a single data source. For example, an NLP model is trained on the text data source, while a computer vision model is trained on an image dataset. Similarly, an acoustic model uses wake word detection and noise cancellation parameters to handle speech. The type of ML employed here is a single modal AI as the model outcome is mapped to one source of data type—text, images, and speech.

On the other hand, Multimodal AI combines visual and speech modalities to create scenarios that match human perception. DALL-E from OpenAI is a recent example of multimodal AI-generated images from texts. Google’s multimodal AI – multitask unified model (MUM) – helps enhance the user search experience since the search results are shortlisted by considering contextual information mined from 75 different languages. Another example is NVIDIA’s GauGAN2 model, which generates photo-realistic images from text inputs. It uses text-to-image generation to develop the finest photorealistic art.

up:: 🏠 Home