BE BentoML
BentoML is an open-source framework for packaging, containerizing, and deploying machine learning models into production-ready services.
Model serving, execution runtimes, acceleration, and routing for inference workloads.
This is a legacy category URL. Results are aggregated using the new taxonomy.
Production model serving frameworks and stacks.
BE BentoML is an open-source framework for packaging, containerizing, and deploying machine learning models into production-ready services.
BE An open-source serverless runtime for AI workloads providing ultrafast container startup, GPU support, and scale-to-zero capabilities.
GO An open source durable computing platform that simplifies building and deploying highly reliable distributed systems.
HA Discover HAMi, the middleware that simplifies AI resource management across diverse hardware, enhancing performance and cluster utilization in cloud-native environments.
KS KServe: a Kubernetes-native model inference platform for scalable predictive and generative AI deployments.
KV Virtualized elastic KV cache that brings OS-style virtual memory to LLM systems, enabling demand-driven KV allocation and improved GPU utilization.
MO Motia — a backend-first framework for APIs, backend workflows, event-driven orchestration and AI workforce scheduling, designed to provide a React-like developer experience for server-side systems.
OM Local LLM server for Apple Silicon with continuous batching and SSD caching. Native macOS menu bar app, OpenAI-compatible API, optimized for Claude Code. Supports multiple model types including text LLMs, VLMs, and embedding models.
TR Triton Inference Server: NVIDIA's high-performance inference server supporting multiple model formats and deployment options.
Runtime engines and low-level inference kernels.
AM Microsoft's tooling for development and deployment assistance, aimed at performance analysis, model deployment and pipeline support for AI projects.
AP A unified analytics engine for large-scale data processing, supporting batch, streaming and machine learning workloads.
CH A production-focused inference framework for large language models, offering high performance, multi-hardware support, and scalable deployment.
CO An open-source plugin for engineering compounding scenarios that integrates with Claude Code.
CO Coral NPU is an energy-efficient machine learning accelerator core for edge devices provided by Google Coral.
DE Clean and efficient FP8 GEMM kernels with fine-grained scaling to support accurate and performant low-precision matrix computations.
DE A high-performance library for training and inference that dramatically speeds up large-scale deep learning while reducing cost.
EX exo: Run your own AI cluster at home using everyday devices, supporting distributed inference and a ChatGPT-compatible API.
FL A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.
FL FlashInfer is a kernel library and JIT toolset for LLM serving that implements efficient attention and sampling kernels to improve GPU throughput and latency for inference serving.
GE Universal physics simulation and generative data platform for robotics and embodied AI, open-source physics engine.
GG ggml is a lightweight tensor library for machine learning optimized for efficient model inference across hardware.
GP gpt-oss is an open-weight model series released by OpenAI, designed for high-reasoning and customizable developer use cases.
KA A Kubernetes-native scheduler for large-scale AI workloads, providing efficient resource orchestration and optimization for containerized AI training and inference workflows.
KT A flexible framework for LLM inference optimizations, offering kernel injection, prefix caching and multi-level acceleration strategies.
KU An AI inferencing operator for Kubernetes that simplifies deploying LLMs, embeddings, and speech-to-text services.
KU KubeRay is the Ray Project's open-source Kubernetes operator for deploying and managing Ray applications on Kubernetes.
LI A high-performance, scalable lightweight deep learning inference runtime for edge devices.
LL A Kubernetes-native distributed inference stack providing well‑lit paths for high-performance LLM serving across diverse accelerators.
MA An open-source textbook on engineering real-world AI systems, covering system design from edge devices to cloud deployment.
MI A lightweight, high-performance inference framework for large language models that balances engineering practicality with readability.
MI mistral.rs is a lightweight, high-performance Rust inference library for running Mistral-family models in resource-constrained environments.
MO Mooncake is a KVCache-centric disaggregated architecture for LLM serving, providing a high-performance Transfer Engine and distributed KVCache storage.
NC High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.
NE A GPU-accelerated physics simulation engine built on NVIDIA Warp, targeting robotics and simulation research.
NV NVIDIA GPU Operator automates deployment, configuration, and management of GPU components and drivers in Kubernetes.
ON ONNX is an open model exchange format and ecosystem that improves interoperability between frameworks, tools, and hardware.
ON ONNX Runtime is a cross-platform, high-performance machine learning inference and training accelerator that runs models exported from PyTorch, TensorFlow/Keras and traditional ML libraries across diverse hardware.
OP OpenVINO is an open-source toolkit from Intel for optimizing and deploying deep learning models for inference.
RA RamaLama simplifies running and serving AI models by packaging them as OCI container images and choosing hardware-optimized images for the host automatically.
SP An open-source accelerated engine for time-series and data-grounded AI, offering SQL queries, full-text search, and LLM inference.
TI tinygrad is a minimalist deep learning library that implements tensor operations and autodiff with very little code, making it ideal for learning and experimentation.
TR Triton is a language and toolchain for high-performance deep learning kernels and compiler development, designed to simplify GPU kernel programming while delivering strong performance.
VL A reference system for Kubernetes-native cluster deployment and community-driven performance optimization for vLLM.
XG An efficient, flexible and portable structured generation engine that enforces syntactic correctness via constrained decoding.
XI A model serving and inference framework that supports multiple backends, distributed deployments, and OpenAI-compatible APIs.
Sandboxes and isolated execution environments for AI workloads.
AG An experimental sandbox project by Kubernetes SIGs aiming to provide a Kubernetes-native environment for running, orchestrating, and managing agent workloads securely and at scale.
BO A lightweight runtime and container tooling for embedding, sandboxing, and shipping AI agents.
E2 Secure open source cloud runtime for AI apps & AI agents.
FL A Nix-powered, reproducible and shareable development environment and package manager.
OM OpenMind's modular AI runtime for deploying multimodal agents across digital environments and physical robots
OP A universal sandbox platform for AI application scenarios, providing multi-language SDKs, a unified sandbox protocol, and extensible runtimes.
SA A lightweight sandboxing tool for enforcing filesystem and network restrictions on arbitrary processes at the OS level, without requiring a container.
GPU-oriented acceleration and performance optimization.
CU CUDA Templates for Linear Algebra Subroutines (CUTLASS), a high-performance matrix computation template library provided by NVIDIA.
LI A fast, distributed, high-performance gradient boosting framework for decision tree algorithms, widely used for ranking, classification, and large-scale ML tasks.
XL XLA (Accelerated Linear Algebra) is a compiler for machine learning that generates optimized code for CPUs, GPUs, and accelerators to improve model execution performance.
On-device and local inference systems.
TR Transformers.js: a JavaScript implementation of Hugging Face Transformers for the browser and Node, with WASM/ONNX backends for optimized on-device inference.
Routing, policy, and gateway controls for model access.
AG A high-performance proxy data plane for agents, providing security, observability, and governance capabilities for agent-to-agent and agent-to-tool communication.
AI Portkey's AI Gateway is a high-performance, enterprise-ready LLM routing and governance platform that supports many model providers and rich guardrail policies.
AR ArchGW is a model-native proxy server for agents that provides routing, guardrails, tool calling and end-to-end observability.
CL An intelligent code routing tool that optimizes Claude AI request distribution and response handling in code development, enhancing development efficiency.
CL ClawRouter is an agent-native LLM router empowering OpenClaw with smart routing, cost optimization, and micropayments support.
CL CloudBase AI ToolKit provides out-of-the-box AI IDE, frontend and backend templates, and deployment pipelines to help developers quickly generate, deploy and host full-stack AI applications.
CS An open-source platform for LLM asset and lifecycle management, offering SaaS and on-premise deployment with Python SDK compatibility.
EN AI API gateway based on Envoy Proxy, providing high-performance routing, load balancing, and security management for AI services.
GA Combines Gateway API with Envoy External Processing to provide a Kubernetes-native inference gateway for optimizing GenAI inference deployments.
HI A cloud-native API gateway based on Istio and Envoy that supports Wasm plugins and AI Gateway features including MCP hosting and multi-model integrations.
LI LiteLLM is a lightweight LLM gateway and proxy framework providing a unified OpenAI-format API, routing, rate-limits, and pluggable provider integrations for production deployments.
LL Lightweight multi-provider LLM client with an OpenAI-compatible server API and optional chat UI.
LO LocalAGI is a self-hostable agent platform focused on privacy, local execution, and extensibility.
OB An open-source MCP gateway and AI platform for self-hosted or cloud deployments, providing chat, administration, and audit capabilities.
PL Plano is an open-source AI gateway and policy runtime for routing, securing, and observing production LLM/API traffic.
TO An enterprise platform for deploying and governing MCP servers, featuring registry, runtime, gateway, and portal components.
VL An intelligent Mixture-of-Models router that directs requests to the most suitable models to improve inference accuracy and efficiency.
No projects match the current filters.