Microsoft has introduced two new additions to its Phi-4 line of artificial intelligence models, designed with relatively modest system requirements. One of these models is multimodal, meaning it can process multiple data formats.
The Phi-4-mini model is exclusively text-based, while Phi-4-multimodal is an enhanced version capable of handling visual and audio inputs. According to Microsoft, both models outperform alternatives of similar sizes in specific tasks.
Features and Capabilities
Phi-4-mini has 3.8 billion parameters, making it compact enough to run on mobile devices. It is based on a special version of the Transformer architecture. Unlike standard transformer models, which analyze words both before and after a given term, Phi-4-mini uses the Decoder-Only Transformer approach, analyzing only preceding text. This optimization reduces computing load and increases processing speed.
To further enhance performance, Microsoft integrated Grouped Query Attention technology, which helps the model prioritize the most relevant data for each task. Phi-4-mini can generate text, translate documents, and manage external applications. Microsoft claims it also performs well in solving mathematical problems and writing computer code, even when complex reasoning is required. The company states that its accuracy “significantly” exceeds that of other models of similar size.
Phi-4-multimodal is an extended version of Phi-4-mini, featuring 5.6 billion parameters. It supports text, images, audio, and video as input formats. Microsoft developed this model using the Mixture of LoRAs method. Typically, adapting AI to new tasks involves modifying its weights—parameters that determine how data is processed. The LoRA (Low-Rank Adaptation) method simplifies this by adding a limited set of optimized weights. Mixture of LoRAs extends this technique to multimodal processing, allowing Phi-4-multimodal to incorporate specialized weights for audio and video tasks.
Performance and Availability
In tests related to visual data processing, Phi-4-multimodal scored 72 points, slightly trailing leading models from OpenAI and Google. However, in simultaneous video and audio processing, it significantly outperformed Google Gemini-2.0 Flash and the open-source InternOmni, notes NIXSolutions.
Both models are available on the Hugging Face platform under the MIT license, allowing for commercial use. We’ll keep you updated as more developments emerge.