Inference & Runtime

Model serving, execution runtimes, acceleration, and routing for inference workloads.

73 Projects 6 Subcategory 35 Tags

Tracked

This is a legacy category URL. Results are aggregated using the new taxonomy.

Production model serving frameworks and stacks.

BentoML

BentoML is an open-source framework for packaging, containerizing, and deploying machine learning models into production-ready services.

-- Loading score

Beta9

An open-source serverless runtime for AI workloads providing ultrafast container startup, GPU support, and scale-to-zero capabilities.

-- Loading score

Golem

An open source durable computing platform that simplifies building and deploying highly reliable distributed systems.

-- Loading score

HAMi

Discover HAMi, the middleware that simplifies AI resource management across diverse hardware, enhancing performance and cluster utilization in cloud-native environments.

-- Loading score

KServe

KServe: a Kubernetes-native model inference platform for scalable predictive and generative AI deployments.

-- Loading score

kvcached

Virtualized elastic KV cache that brings OS-style virtual memory to LLM systems, enabling demand-driven KV allocation and improved GPU utilization.

-- Loading score

Motia

Motia — a backend-first framework for APIs, backend workflows, event-driven orchestration and AI workforce scheduling, designed to provide a React-like developer experience for server-side systems.

-- Loading score

Local LLM server for Apple Silicon with continuous batching and SSD caching. Native macOS menu bar app, OpenAI-compatible API, optimized for Claude Code. Supports multiple model types including text LLMs, VLMs, and embedding models.

-- Loading score

Triton Inference Server

Triton Inference Server: NVIDIA's high-performance inference server supporting multiple model formats and deployment options.

-- Loading score

Runtime engines and low-level inference kernels.

Amplifier

Microsoft's tooling for development and deployment assistance, aimed at performance analysis, model deployment and pipeline support for AI projects.

-- Loading score

Apache Spark

A unified analytics engine for large-scale data processing, supporting batch, streaming and machine learning workloads.

-- Loading score

Chitu

A production-focused inference framework for large language models, offering high performance, multi-hardware support, and scalable deployment.

-- Loading score

Compounding Engineering Plugin

An open-source plugin for engineering compounding scenarios that integrates with Claude Code.

-- Loading score

Coral NPU

Coral NPU is an energy-efficient machine learning accelerator core for edge devices provided by Google Coral.

-- Loading score

DeepGEMM

Clean and efficient FP8 GEMM kernels with fine-grained scaling to support accurate and performant low-precision matrix computations.

-- Loading score

DeepSpeed

A high-performance library for training and inference that dramatically speeds up large-scale deep learning while reducing cost.

-- Loading score

exo

exo: Run your own AI cluster at home using everyday devices, supporting distributed inference and a ChatGPT-compatible API.

-- Loading score

Flash Linear Attention (fla)

A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.

-- Loading score

FlashInfer

FlashInfer is a kernel library and JIT toolset for LLM serving that implements efficient attention and sampling kernels to improve GPU throughput and latency for inference serving.

-- Loading score

Genesis

Universal physics simulation and generative data platform for robotics and embodied AI, open-source physics engine.

-- Loading score

ggml

ggml is a lightweight tensor library for machine learning optimized for efficient model inference across hardware.

-- Loading score

gpt-oss

gpt-oss is an open-weight model series released by OpenAI, designed for high-reasoning and customizable developer use cases.

-- Loading score

KAI Scheduler

A Kubernetes-native scheduler for large-scale AI workloads, providing efficient resource orchestration and optimization for containerized AI training and inference workflows.

-- Loading score

KTransformers

A flexible framework for LLM inference optimizations, offering kernel injection, prefix caching and multi-level acceleration strategies.

-- Loading score

KubeAI

An AI inferencing operator for Kubernetes that simplifies deploying LLMs, embeddings, and speech-to-text services.

-- Loading score

KubeRay

KubeRay is the Ray Project's open-source Kubernetes operator for deploying and managing Ray applications on Kubernetes.

-- Loading score

LiteRT

A high-performance, scalable lightweight deep learning inference runtime for edge devices.

-- Loading score

llm-d

A Kubernetes-native distributed inference stack providing well‑lit paths for high-performance LLM serving across diverse accelerators.

-- Loading score

Machine Learning Systems (MLSysBook)

An open-source textbook on engineering real-world AI systems, covering system design from edge devices to cloud deployment.

-- Loading score

Mini-SGLang

A lightweight, high-performance inference framework for large language models that balances engineering practicality with readability.

-- Loading score

mistral.rs

mistral.rs is a lightweight, high-performance Rust inference library for running Mistral-family models in resource-constrained environments.

-- Loading score

Mooncake

Mooncake is a KVCache-centric disaggregated architecture for LLM serving, providing a high-performance Transfer Engine and distributed KVCache storage.

-- Loading score

NCCL

High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.

-- Loading score

Newton

A GPU-accelerated physics simulation engine built on NVIDIA Warp, targeting robotics and simulation research.

-- Loading score

NVIDIA GPU Operator

NVIDIA GPU Operator automates deployment, configuration, and management of GPU components and drivers in Kubernetes.

-- Loading score

ONNX

ONNX is an open model exchange format and ecosystem that improves interoperability between frameworks, tools, and hardware.

-- Loading score

ONNX Runtime

ONNX Runtime is a cross-platform, high-performance machine learning inference and training accelerator that runs models exported from PyTorch, TensorFlow/Keras and traditional ML libraries across diverse hardware.

-- Loading score

OpenVINO

OpenVINO is an open-source toolkit from Intel for optimizing and deploying deep learning models for inference.

-- Loading score

RamaLama

RamaLama simplifies running and serving AI models by packaging them as OCI container images and choosing hardware-optimized images for the host automatically.

-- Loading score

Spice.ai

An open-source accelerated engine for time-series and data-grounded AI, offering SQL queries, full-text search, and LLM inference.

-- Loading score

tinygrad

tinygrad is a minimalist deep learning library that implements tensor operations and autodiff with very little code, making it ideal for learning and experimentation.

-- Loading score

Triton

Triton is a language and toolchain for high-performance deep learning kernels and compiler development, designed to simplify GPU kernel programming while delivering strong performance.

-- Loading score

vLLM Production Stack

A reference system for Kubernetes-native cluster deployment and community-driven performance optimization for vLLM.

-- Loading score

XGrammar

An efficient, flexible and portable structured generation engine that enforces syntactic correctness via constrained decoding.

-- Loading score

Xinference (Xorbits Inference)

A model serving and inference framework that supports multiple backends, distributed deployments, and OpenAI-compatible APIs.

-- Loading score

Sandboxes and isolated execution environments for AI workloads.

Agent Sandbox

An experimental sandbox project by Kubernetes SIGs aiming to provide a Kubernetes-native environment for running, orchestrating, and managing agent workloads securely and at scale.

-- Loading score

BoxLite

A lightweight runtime and container tooling for embedding, sandboxing, and shipping AI agents.

-- Loading score

E2B

Secure open source cloud runtime for AI apps & AI agents.

-- Loading score

Flox

A Nix-powered, reproducible and shareable development environment and package manager.

-- Loading score

OM1

OpenMind's modular AI runtime for deploying multimodal agents across digital environments and physical robots

-- Loading score

OpenSandbox

A universal sandbox platform for AI application scenarios, providing multi-language SDKs, a unified sandbox protocol, and extensible runtimes.

-- Loading score

Sandbox Runtime

A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the OS level, without requiring a container.

-- Loading score

GPU-oriented acceleration and performance optimization.

CUTLASS

CUDA Templates for Linear Algebra Subroutines (CUTLASS), a high-performance matrix computation template library provided by NVIDIA.

-- Loading score

LightGBM

A fast, distributed, high-performance gradient boosting framework for decision tree algorithms, widely used for ranking, classification, and large-scale ML tasks.

-- Loading score

XLA

XLA (Accelerated Linear Algebra) is a compiler for machine learning that generates optimized code for CPUs, GPUs, and accelerators to improve model execution performance.

-- Loading score

On-device and local inference systems.

Transformers.js

Transformers.js: a JavaScript implementation of Hugging Face Transformers for the browser and Node, with WASM/ONNX backends for optimized on-device inference.

-- Loading score

Routing, policy, and gateway controls for model access.