AI AIBrix
AIBrix is a cloud-native infrastructure framework for large-scale LLM inference, providing scalable and cost-efficient inference components.
Model serving, execution runtimes, acceleration, and routing for inference workloads.
Production model serving frameworks and stacks.
AI AIBrix is a cloud-native infrastructure framework for large-scale LLM inference, providing scalable and cost-efficient inference components.
BE BentoML is an open-source framework for packaging, containerizing, and deploying machine learning models into production-ready services.
BE An open-source serverless runtime for AI workloads providing ultrafast container startup, GPU support, and scale-to-zero capabilities.
DY Explore Dynamo by NVIDIA, an open-source framework for efficient multi-GPU inference, optimizing throughput and latency for large-scale deployments.
GO An open source durable computing platform that simplifies building and deploying highly reliable distributed systems.
GP Open-source GPU cluster manager for efficient model training and high-performance inference orchestration.
HA Discover HAMi, the middleware that simplifies AI resource management across diverse hardware, enhancing performance and cluster utilization in cloud-native environments.
KA Kaito is a Kubernetes AI Toolchain Operator that automates deployment and management of large-model inference and tuning workflows, with built-in RAG support and node auto-provisioning.
KS KServe: a Kubernetes-native model inference platform for scalable predictive and generative AI deployments.
KV Virtualized elastic KV cache that brings OS-style virtual memory to LLM systems, enabling demand-driven KV allocation and improved GPU utilization.
LM A high-performance KV cache layer for LLM serving that reduces time-to-first-token and increases throughput, especially for long-context and multi-turn scenarios.
MO An open, production-grade AI platform including the MAX inference server and Mojo libraries to accelerate model deployment across hardware.
NV NVIDIA Cloud Functions (NVCF) is a platform for deploying, managing, and running GPU-accelerated inference and streaming workloads at scale, powering build.nvidia.com.
OM Local LLM server for Apple Silicon with continuous batching and SSD caching. Native macOS menu bar app, OpenAI-compatible API, optimized for Claude Code. Supports multiple model types including text LLMs, VLMs, and embedding models.
RO Roboflow Inference is a computer-vision inference and workflow platform that supports local and cloud deployment, video stream workflows, and rich model integrations.
TE NVIDIA's open-source toolbox for optimized LLM inference, designed for efficient GPU serving and enterprise deployment.
TR Triton Inference Server: NVIDIA's high-performance inference server supporting multiple model formats and deployment options.
VL High-throughput, memory-efficient inference and serving engine for large language models.
Runtime engines and low-level inference kernels.
AM Microsoft's tooling for development and deployment assistance, aimed at performance analysis, model deployment and pipeline support for AI projects.
CA Candle by Hugging Face: a minimalist, high-performance ML framework in Rust designed for serverless inference and lightweight deployments.
CH A production-focused inference framework for large language models, offering high performance, multi-hardware support, and scalable deployment.
CO Coral NPU is an energy-efficient machine learning accelerator core for edge devices provided by Google Coral.
DE Clean and efficient FP8 GEMM kernels with fine-grained scaling to support accurate and performant low-precision matrix computations.
EX exo: Run your own AI cluster at home using everyday devices, supporting distributed inference and a ChatGPT-compatible API.
FL Fast and memory-efficient exact attention implementation optimized for large Transformer training and inference.
FL A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.
FL FlashInfer is a kernel library and JIT toolset for LLM serving that implements efficient attention and sampling kernels to improve GPU throughput and latency for inference serving.
GE Universal physics simulation and generative data platform for robotics and embodied AI, open-source physics engine.
GP gpt-oss is an open-weight model series released by OpenAI, designed for high-reasoning and customizable developer use cases.
KA A Kubernetes-native scheduler for large-scale AI workloads, providing efficient resource orchestration and optimization for containerized AI training and inference workflows.
KT A flexible framework for LLM inference optimizations, offering kernel injection, prefix caching and multi-level acceleration strategies.
KU An AI inferencing operator for Kubernetes that simplifies deploying LLMs, embeddings, and speech-to-text services.
KU KubeRay is the Ray Project's open-source Kubernetes operator for deploying and managing Ray applications on Kubernetes.
LI A high-performance, scalable lightweight deep learning inference runtime for edge devices.
LL llama.cpp is a lightweight LLM inference library in C/C++, designed for efficient local and cloud inference across diverse hardware.
LL A Kubernetes-native distributed inference stack providing well‑lit paths for high-performance LLM serving across diverse accelerators.
MA An open-source textbook on engineering real-world AI systems, covering system design from edge devices to cloud deployment.
MI A lightweight, high-performance inference framework for large language models that balances engineering practicality with readability.
MI mistral.rs is a lightweight, high-performance Rust inference library for running Mistral-family models in resource-constrained environments.
ML A Python toolkit for running and fine-tuning LLMs on Apple Silicon, with support for quantization, distributed inference and Hugging Face integration.
MO Mooncake is a KVCache-centric disaggregated architecture for LLM serving, providing a high-performance Transfer Engine and distributed KVCache storage.
NC High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.
OL Local large language model runner that enables users to easily run and manage various open-source LLM models in local environments.
ON ONNX is an open model exchange format and ecosystem that improves interoperability between frameworks, tools, and hardware.
ON ONNX Runtime is a cross-platform, high-performance machine learning inference and training accelerator that runs models exported from PyTorch, TensorFlow/Keras and traditional ML libraries across diverse hardware.
OP OpenVINO is an open-source toolkit from Intel for optimizing and deploying deep learning models for inference.
OU A library for structured generation that simplifies producing and validating JSON/Pydantic outputs directly from LLMs.
RA RamaLama simplifies running and serving AI models by packaging them as OCI container images and choosing hardware-optimized images for the host automatically.
SG High-performance open-source framework for LLM and VLM inference, supporting multimodal, extreme concurrency, and flexible frontend programming.
SP An open-source accelerated engine for time-series and data-grounded AI, offering SQL queries, full-text search, and LLM inference.
TI tinygrad is a minimalist deep learning library that implements tensor operations and autodiff with very little code, making it ideal for learning and experimentation.
TR Triton is a language and toolchain for high-performance deep learning kernels and compiler development, designed to simplify GPU kernel programming while delivering strong performance.
VL A reference system for Kubernetes-native cluster deployment and community-driven performance optimization for vLLM.
XG An efficient, flexible and portable structured generation engine that enforces syntactic correctness via constrained decoding.
XI A model serving and inference framework that supports multiple backends, distributed deployments, and OpenAI-compatible APIs.
Sandboxes and isolated execution environments for AI workloads.
AG Google's open source distributed agent runtime that coordinates agentic loops, manages executions with event logging, and provides native recovery and resumption for reliable agent deployments.
AG An experimental sandbox project by Kubernetes SIGs aiming to provide a Kubernetes-native environment for running, orchestrating, and managing agent workloads securely and at scale.
AI All-in-one sandbox environment for AI agents that combines Browser, Shell, File, MCP and VSCode Server into a single containerized runtime.
BO A lightweight runtime and container tooling for embedding, sandboxing, and shipping AI agents.
CU A high-performance, hardware-isolated sandbox service for AI agents built on RustVMM and KVM, with E2B SDK compatibility and sub-60ms cold starts.
DA Secure and elastic infrastructure for running AI-generated code with isolated sandboxes, concurrency and persistent sandbox state.
E2 Secure open source cloud runtime for AI apps & AI agents.
FL A Nix-powered, reproducible and shareable development environment and package manager.
LI A security-focused library OS that minimizes host interfaces and supports kernel- and user-mode constrained execution.
MO A minimal, secure Python interpreter written in Rust for safely executing LLM-generated Python code.
OM OpenMind's modular AI runtime for deploying multimodal agents across digital environments and physical robots
OP A universal sandbox platform for AI application scenarios, providing multi-language SDKs, a unified sandbox protocol, and extensible runtimes.
OP NVIDIA OpenShell is a safe, private runtime for autonomous AI agents, providing sandboxed execution environments with declarative YAML policies to protect data, credentials, and infrastructure from unauthorized access.
SA A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the OS level, without requiring a container.
GPU-oriented acceleration and performance optimization.
CU CUDA Templates for Linear Algebra Subroutines (CUTLASS), a high-performance matrix computation template library provided by NVIDIA.
FL Efficient multi-head latent attention kernels designed to accelerate large-scale Transformer training and inference with reduced memory footprint.
TI TileLang is a domain-specific language for high-performance AI kernels that simplifies writing GPU/CPU/accelerator operators.
TR Transformer Engine is an NVIDIA library focused on low-precision training and inference optimizations for Transformer models, supporting formats like FP8 to improve speed and memory efficiency.
XL XLA (Accelerated Linear Algebra) is a compiler for machine learning that generates optimized code for CPUs, GPUs, and accelerators to improve model execution performance.
On-device and local inference systems.
CA An energy-efficient inference engine and numerical computing framework for phones, optimized for ARM CPUs to run large models with low power and memory footprint.
GG ggml is a lightweight tensor library for machine learning optimized for efficient model inference across hardware.
TR Transformers.js: a JavaScript implementation of Hugging Face Transformers for the browser and Node, with WASM/ONNX backends for optimized on-device inference.
WE High-performance in-browser LLM inference engine that leverages WebGPU for hardware-accelerated, privacy-preserving inference in the browser.
Routing, policy, and gateway controls for model access.
AG A high-performance proxy data plane for agents, providing security, observability, and governance capabilities for agent-to-agent and agent-to-tool communication.
AI Portkey's AI Gateway is a high-performance, enterprise-ready LLM routing and governance platform that supports many model providers and rich guardrail policies.
AR ArchGW is a model-native proxy server for agents that provides routing, guardrails, tool calling and end-to-end observability.
CL An intelligent code routing tool that optimizes Claude AI request distribution and response handling in code development, enhancing development efficiency.
CL ClawRouter is an agent-native LLM router empowering OpenClaw with smart routing, cost optimization, and micropayments support.
CL CloudBase AI ToolKit provides out-of-the-box AI IDE, frontend and backend templates, and deployment pipelines to help developers quickly generate, deploy and host full-stack AI applications.
EN AI API gateway based on Envoy Proxy, providing high-performance routing, load balancing, and security management for AI services.
GA Combines Gateway API with Envoy External Processing to provide a Kubernetes-native inference gateway for optimizing GenAI inference deployments.
HI A cloud-native API gateway based on Istio and Envoy that supports Wasm plugins and AI Gateway features including MCP hosting and multi-model integrations.
LI LiteLLM is a lightweight LLM gateway and proxy framework providing a unified OpenAI-format API, routing, rate-limits, and pluggable provider integrations for production deployments.
LL Lightweight multi-provider LLM client with an OpenAI-compatible server API and optional chat UI.
PL Plano is an open-source AI gateway and policy runtime for routing, securing, and observing production LLM/API traffic.
SU A secure proxy between apps, models and tools that enforces runtime protections and validates tool calls.
VL An intelligent Mixture-of-Models router that directs requests to the most suitable models to improve inference accuracy and efficiency.
No projects match the current filters.