Inference & Runtime

Model serving, execution runtimes, acceleration, and routing for inference workloads.

92 Projects 6 Subcategory 52 Tags

TrackedUnavailableArchivedInactive

Production model serving frameworks and stacks.

AIBrix

AIBrix is a cloud-native infrastructure framework for large-scale LLM inference, providing scalable and cost-efficient inference components.

-- Loading score

BentoML

BentoML is an open-source framework for packaging, containerizing, and deploying machine learning models into production-ready services.

-- Loading score

Beta9

An open-source serverless runtime for AI workloads providing ultrafast container startup, GPU support, and scale-to-zero capabilities.

-- Loading score

Dynamo

Explore Dynamo by NVIDIA, an open-source framework for efficient multi-GPU inference, optimizing throughput and latency for large-scale deployments.

-- Loading score

Golem

An open source durable computing platform that simplifies building and deploying highly reliable distributed systems.

-- Loading score

gpustack

Open-source GPU cluster manager for efficient model training and high-performance inference orchestration.

-- Loading score

HAMi

Discover HAMi, the middleware that simplifies AI resource management across diverse hardware, enhancing performance and cluster utilization in cloud-native environments.

-- Loading score

Kaito

Kaito is a Kubernetes AI Toolchain Operator that automates deployment and management of large-model inference and tuning workflows, with built-in RAG support and node auto-provisioning.

-- Loading score

KServe

KServe: a Kubernetes-native model inference platform for scalable predictive and generative AI deployments.

-- Loading score

kvcached

Virtualized elastic KV cache that brings OS-style virtual memory to LLM systems, enabling demand-driven KV allocation and improved GPU utilization.

-- Loading score

LMCache

A high-performance KV cache layer for LLM serving that reduces time-to-first-token and increases throughput, especially for long-context and multi-turn scenarios.

-- Loading score

Modular Platform

An open, production-grade AI platform including the MAX inference server and Mojo libraries to accelerate model deployment across hardware.

-- Loading score

NVIDIA Cloud Functions

NVIDIA Cloud Functions (NVCF) is a platform for deploying, managing, and running GPU-accelerated inference and streaming workloads at scale, powering build.nvidia.com.

-- Loading score

Local LLM server for Apple Silicon with continuous batching and SSD caching. Native macOS menu bar app, OpenAI-compatible API, optimized for Claude Code. Supports multiple model types including text LLMs, VLMs, and embedding models.

-- Loading score

Roboflow Inference

Roboflow Inference is a computer-vision inference and workflow platform that supports local and cloud deployment, video stream workflows, and rich model integrations.

-- Loading score

TensorRT-LLM

NVIDIA's open-source toolbox for optimized LLM inference, designed for efficient GPU serving and enterprise deployment.

-- Loading score

Triton Inference Server

Triton Inference Server: NVIDIA's high-performance inference server supporting multiple model formats and deployment options.

-- Loading score

vLLM

High-throughput, memory-efficient inference and serving engine for large language models.

-- Loading score

Runtime engines and low-level inference kernels.

Amplifier

Microsoft's tooling for development and deployment assistance, aimed at performance analysis, model deployment and pipeline support for AI projects.

-- Loading score

Candle

Candle by Hugging Face: a minimalist, high-performance ML framework in Rust designed for serverless inference and lightweight deployments.

-- Loading score

Chitu

A production-focused inference framework for large language models, offering high performance, multi-hardware support, and scalable deployment.

-- Loading score

Coral NPU

Coral NPU is an energy-efficient machine learning accelerator core for edge devices provided by Google Coral.

-- Loading score

DeepGEMM

Clean and efficient FP8 GEMM kernels with fine-grained scaling to support accurate and performant low-precision matrix computations.

-- Loading score

exo

exo: Run your own AI cluster at home using everyday devices, supporting distributed inference and a ChatGPT-compatible API.

-- Loading score

Flash Attention

Fast and memory-efficient exact attention implementation optimized for large Transformer training and inference.

-- Loading score

Flash Linear Attention (fla)

A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.

-- Loading score

FlashInfer

FlashInfer is a kernel library and JIT toolset for LLM serving that implements efficient attention and sampling kernels to improve GPU throughput and latency for inference serving.

-- Loading score

Genesis

Universal physics simulation and generative data platform for robotics and embodied AI, open-source physics engine.

-- Loading score

gpt-oss

gpt-oss is an open-weight model series released by OpenAI, designed for high-reasoning and customizable developer use cases.

-- Loading score

KAI Scheduler

A Kubernetes-native scheduler for large-scale AI workloads, providing efficient resource orchestration and optimization for containerized AI training and inference workflows.

-- Loading score

KTransformers

A flexible framework for LLM inference optimizations, offering kernel injection, prefix caching and multi-level acceleration strategies.

-- Loading score

KubeAI

An AI inferencing operator for Kubernetes that simplifies deploying LLMs, embeddings, and speech-to-text services.

-- Loading score

KubeRay

KubeRay is the Ray Project's open-source Kubernetes operator for deploying and managing Ray applications on Kubernetes.

-- Loading score

LiteRT

A high-performance, scalable lightweight deep learning inference runtime for edge devices.

-- Loading score

llama.cpp

llama.cpp is a lightweight LLM inference library in C/C++, designed for efficient local and cloud inference across diverse hardware.

-- Loading score

llm-d

A Kubernetes-native distributed inference stack providing well‑lit paths for high-performance LLM serving across diverse accelerators.

-- Loading score

Machine Learning Systems (MLSysBook)

An open-source textbook on engineering real-world AI systems, covering system design from edge devices to cloud deployment.

-- Loading score

Mini-SGLang

A lightweight, high-performance inference framework for large language models that balances engineering practicality with readability.

-- Loading score

mistral.rs

mistral.rs is a lightweight, high-performance Rust inference library for running Mistral-family models in resource-constrained environments.

-- Loading score

MLX LM

A Python toolkit for running and fine-tuning LLMs on Apple Silicon, with support for quantization, distributed inference and Hugging Face integration.

-- Loading score

Mooncake

Mooncake is a KVCache-centric disaggregated architecture for LLM serving, providing a high-performance Transfer Engine and distributed KVCache storage.

-- Loading score

NCCL

High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.

-- Loading score

Ollama

Local large language model runner that enables users to easily run and manage various open-source LLM models in local environments.

-- Loading score

ONNX

ONNX is an open model exchange format and ecosystem that improves interoperability between frameworks, tools, and hardware.

-- Loading score

ONNX Runtime

ONNX Runtime is a cross-platform, high-performance machine learning inference and training accelerator that runs models exported from PyTorch, TensorFlow/Keras and traditional ML libraries across diverse hardware.

-- Loading score

OpenVINO

OpenVINO is an open-source toolkit from Intel for optimizing and deploying deep learning models for inference.

-- Loading score

Outlines

A library for structured generation that simplifies producing and validating JSON/Pydantic outputs directly from LLMs.

-- Loading score

RamaLama

RamaLama simplifies running and serving AI models by packaging them as OCI container images and choosing hardware-optimized images for the host automatically.

-- Loading score

SGLang

High-performance open-source framework for LLM and VLM inference, supporting multimodal, extreme concurrency, and flexible frontend programming.

-- Loading score

Spice.ai

An open-source accelerated engine for time-series and data-grounded AI, offering SQL queries, full-text search, and LLM inference.

-- Loading score

tinygrad

tinygrad is a minimalist deep learning library that implements tensor operations and autodiff with very little code, making it ideal for learning and experimentation.

-- Loading score

Triton

Triton is a language and toolchain for high-performance deep learning kernels and compiler development, designed to simplify GPU kernel programming while delivering strong performance.

-- Loading score

vLLM Production Stack

A reference system for Kubernetes-native cluster deployment and community-driven performance optimization for vLLM.

-- Loading score

XGrammar

An efficient, flexible and portable structured generation engine that enforces syntactic correctness via constrained decoding.

-- Loading score

Xinference (Xorbits Inference)

A model serving and inference framework that supports multiple backends, distributed deployments, and OpenAI-compatible APIs.

-- Loading score

Sandboxes and isolated execution environments for AI workloads.

Agent Executor (AX)

Google's open source distributed agent runtime that coordinates agentic loops, manages executions with event logging, and provides native recovery and resumption for reliable agent deployments.

-- Loading score

Agent Sandbox

An experimental sandbox project by Kubernetes SIGs aiming to provide a Kubernetes-native environment for running, orchestrating, and managing agent workloads securely and at scale.

-- Loading score

AIO Sandbox

All-in-one sandbox environment for AI agents that combines Browser, Shell, File, MCP and VSCode Server into a single containerized runtime.

-- Loading score

BoxLite

A lightweight runtime and container tooling for embedding, sandboxing, and shipping AI agents.

-- Loading score

CubeSandbox

A high-performance, hardware-isolated sandbox service for AI agents built on RustVMM and KVM, with E2B SDK compatibility and sub-60ms cold starts.

-- Loading score

Daytona

Secure and elastic infrastructure for running AI-generated code with isolated sandboxes, concurrency and persistent sandbox state.

-- Loading score

E2B

Secure open source cloud runtime for AI apps & AI agents.

-- Loading score

Flox

A Nix-powered, reproducible and shareable development environment and package manager.

-- Loading score

LiteBox

A security-focused library OS that minimizes host interfaces and supports kernel- and user-mode constrained execution.

-- Loading score

Monty

A minimal, secure Python interpreter written in Rust for safely executing LLM-generated Python code.

-- Loading score

OM1

OpenMind's modular AI runtime for deploying multimodal agents across digital environments and physical robots

-- Loading score

OpenSandbox

A universal sandbox platform for AI application scenarios, providing multi-language SDKs, a unified sandbox protocol, and extensible runtimes.

-- Loading score

OpenShell

NVIDIA OpenShell is a safe, private runtime for autonomous AI agents, providing sandboxed execution environments with declarative YAML policies to protect data, credentials, and infrastructure from unauthorized access.

-- Loading score

Sandbox Runtime

A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the OS level, without requiring a container.

-- Loading score

GPU-oriented acceleration and performance optimization.

CUTLASS

CUDA Templates for Linear Algebra Subroutines (CUTLASS), a high-performance matrix computation template library provided by NVIDIA.

-- Loading score

FlashMLA

Efficient multi-head latent attention kernels designed to accelerate large-scale Transformer training and inference with reduced memory footprint.

-- Loading score

TileLang

TileLang is a domain-specific language for high-performance AI kernels that simplifies writing GPU/CPU/accelerator operators.

-- Loading score

Transformer Engine

Transformer Engine is an NVIDIA library focused on low-precision training and inference optimizations for Transformer models, supporting formats like FP8 to improve speed and memory efficiency.

-- Loading score

XLA

XLA (Accelerated Linear Algebra) is a compiler for machine learning that generates optimized code for CPUs, GPUs, and accelerators to improve model execution performance.

-- Loading score

On-device and local inference systems.

Cactus

An energy-efficient inference engine and numerical computing framework for phones, optimized for ARM CPUs to run large models with low power and memory footprint.

-- Loading score

ggml

ggml is a lightweight tensor library for machine learning optimized for efficient model inference across hardware.

-- Loading score

Transformers.js

Transformers.js: a JavaScript implementation of Hugging Face Transformers for the browser and Node, with WASM/ONNX backends for optimized on-device inference.

-- Loading score

WebLLM

High-performance in-browser LLM inference engine that leverages WebGPU for hardware-accelerated, privacy-preserving inference in the browser.

-- Loading score

Routing, policy, and gateway controls for model access.