vLLM-Omni

Tracked

A framework for high-performance, cost-efficient inference and serving of omni-modality models across text, image, video, and audio.

Author vLLM Project Open Sourced 2025-09-11 Last Commit Unknown

vLLM-Omni is an inference and serving framework for omni-modality models that handle text, image, video, and audio inputs alongside heterogeneous outputs. Built on vLLM's proven high-performance inference engine, it extends support to non-autoregressive architectures such as Diffusion Transformers and parallel generation models. The framework targets production-grade deployment where throughput, cost efficiency, and multi-modal flexibility are critical.

Multi-Modal Inference Pipeline

  • Unified serving pipeline covering text, image, video, and audio within a single deployment
  • Low-latency and high-throughput execution powered by efficient KV cache management
  • Staged pipeline scheduling for optimal resource utilization
  • Seamless integration with Hugging Face model weights and OpenAI-compatible API

Decoupled Architecture

  • Model stages separated from inference stages through OmniConnector
  • Distributed deployment with dynamic resource allocation across nodes
  • Independent scaling of prefill and decode stages
  • Native support for non-autoregressive generation workflows and heterogeneous output formats

Scalability and Performance

  • KV cache optimization and memory-compute trade-off strategies inherited from vLLM
  • Tensor, pipeline, and expert parallelism for scaling across GPUs and nodes
  • High-throughput inference backend for large-scale image or video generation pipelines
  • Streaming outputs and low-latency execution for real-time multimedia applications