CL ClearML
ClearML is an open-source MLOps platform providing experiment tracking, data management, pipelines and model serving.
Training frameworks, fine-tuning, evaluation, observability, and optimization workflows.
This is a legacy category URL. Results are aggregated using the new taxonomy.
Training frameworks and distributed training systems.
CL ClearML is an open-source MLOps platform providing experiment tracking, data management, pipelines and model serving.
EA EasyR1 is an efficient, scalable RL training framework for multimodal models, based on veRL and optimized for large-model training.
GY An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
JA High-performance Python library for accelerator-oriented array computation and composable program transformations.
NA A minimal, fast repository for training and fine-tuning medium-sized GPT models, suitable for teaching and experiments.
NE NeMo RL is a scalable post-training reinforcement learning library for large models, supporting high-performance distributed training and multiple backends.
OP An end-to-end framework for creating, deploying and using isolated execution environments, aimed at agentic RL training and environment development.
PY PyTorch Lightning is an open-source framework that streamlines PyTorch training, enabling efficient model development, training, and deployment.
RL RLinf is a flexible and scalable open-source RL infrastructure designed for Embodied and Agentic AI, supporting PPO, GRPO, SAC and more, with seamless scaling to large GPU clusters.
SK A modular full-stack reinforcement learning (RL) library for large language models (LLMs), designed for long-horizon, real-world tasks.
XL xLLM is an open-source framework for vision-language models, providing tools and documentation for training and inference.
SFT, RLHF, preference optimization, and alignment.
AR A fully asynchronous reinforcement learning system for large reasoning and agentic models that emphasizes scalability and reproducibility.
AX A free and open-source LLM post-training and fine-tuning framework that supports multiple models, training methods, and distributed optimizations.
HE Heretic is a fully automated tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration," with a TPE-based parameter optimizer powered by Optuna to automatically find high-quality ablation parameters by co-minimizing refusals and KL divergence from the original model.
LM An extensible, convenient, and efficient toolbox for fine-tuning and inference of large foundation models.
ML A local-first toolkit for inference and fine-tuning of vision-language and omni models using MLX, optimized for macOS and general hardware.
OP An easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM and DeepSpeed, supporting distributed and hybrid-engine training.
PE Petri is an alignment auditing agent designed to quickly explore alignment hypotheses and help researchers automate evaluation workflows.
TR TRL is an open-source toolkit from Hugging Face for reinforcement learning training on transformer models.
TU Tunix is a JAX-native post-training library for LLMs providing efficient fine-tuning, RL training, and distillation tools.
UN High-performance toolkit for fine-tuning and reinforcement learning of large models, with memory-efficient kernels and wide model support.
Evaluation frameworks, benchmark suites, and datasets.
AG Agenta is an open-source LLMOps platform that combines prompt management, evaluation, and observability to help teams ship reliable LLM applications faster.
DE DeepEval is an open-source LLM evaluation framework that provides modular metrics and tooling for testing LLM systems and RAG pipelines.
DE An open-source framework for red-teaming large language models and LLM systems, focused on security and robustness evaluation.
DI A tool for automated data quality evaluation that combines rule-based and model-based assessments.
EA An easy-to-use knowledge editing framework providing multiple editing methods, evaluation metrics and datasets; supports LLMs and some multimodal editing scenarios.
EV An open-source framework for evaluating, testing, and monitoring ML and LLM systems from experiments to production.
FU A static analysis tool that assesses codebase 'legacy-mess' and generates readable Markdown reports.
GI An open-source evaluation and testing framework to detect performance, bias, and security issues in AI systems.
HE Holistic Evaluation of Language Models (HELM) from Stanford CRFM: an open framework for reproducible, transparent model evaluation and benchmark management.
IN Inspector is a visual testing tool for MCP (Model Context Protocol) servers that helps developers validate and visualize server behavior and responses.
LI A lightweight toolkit from Hugging Face for fast, flexible LLM evaluation across multiple backends.
LI LiveBench is a contamination-aware, objective LLM benchmark suite that provides reproducible question sets, automatic scoring, and an online leaderboard.
LM The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets.
MS SWIFT from ModelScope: a scalable, lightweight infrastructure for fine-tuning, evaluating and deploying large and multimodal models, with training, quantization and inference acceleration support.
OP A one-stop platform for evaluating large models, providing benchmarks, evaluation toolkits and leaderboards to reproduce and compare model capabilities.
OP OpenLIT is an open-source platform for AI engineering that provides LLM observability, prompt management, evaluations and guardrails.
OP Opik is an open-source LLM evaluation and observability platform that helps teams build, evaluate and optimize LLM applications.
PR Promptfoo is a developer-first, local LLM testing and red-teaming tool for automated evaluations, vulnerability scanning, and CI integration.
RA Ragas is an open-source toolkit for evaluating and optimizing LLM applications, offering objective metrics, test data generation, and production feedback loops.
RE ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation and leaderboard project covering education, medical, finance, legal, reasoning and other capability dimensions.
TI TileLang is a domain-specific language for high-performance AI kernels that simplifies writing GPU/CPU/accelerator operators.
TO A PyTorch-native post-training and fine-tuning toolkit providing reusable recipes, optimizations, and quantization support for LLM training and evaluation.
Tracing, logging, and runtime observability.
Helicone is an open-source LLM observability and analytics platform that captures requests, traces and sessions to help developers debug, evaluate and optimize model usage.
OP An OpenTelemetry-inspired observability toolkit for LLM/AI, providing request tracing and metrics aggregation for diagnostics and monitoring.
PO Polyaxon is an MLOps platform for managing, training and monitoring large-scale machine learning workloads.
Prompt management, versioning, and quality tooling.
CO A tool that converts codebases into a single LLM prompt for code analysis, generation, and automation workflows.
PR An open-source AI-powered code review and PR assistant that runs locally, in CI, or self-hosted; supports multi-platform integrations and customizable prompts.
RE A tool that packs an entire repository into an AI-friendly file, making it easy to provide structured code context to large models.
Safety filters, guardrails, and risk controls.
AN 754 structured cybersecurity skills for AI agents mapped to 5 frameworks including MITRE ATT&CK, NIST CSF 2.0, and MITRE ATLAS.
LI A security-focused library OS that minimizes host interfaces and supports kernel- and user-mode constrained execution.
MC A tool to scan MCP servers and tools for potential security issues, using multi-engine analysis and customizable reporting.
NE NVIDIA NemoClaw is an open source reference stack for running OpenClaw always-on assistants more safely inside NVIDIA OpenShell, providing guided onboarding, hardened blueprints, state management, and routed inference.
OP NVIDIA OpenShell is a safe, private runtime for autonomous AI agents, providing sandboxed execution environments with declarative YAML policies to protect data, credentials, and infrastructure from unauthorized access.
SU A secure proxy between apps, models and tools that enforces runtime protections and validates tool calls.
No projects match the current filters.