Training, Evaluation & Optimization

Training frameworks, fine-tuning, evaluation, observability, and optimization workflows.

55 Projects 6 Subcategory 36 Tags

Tracked

Training frameworks and distributed training systems.

ClearML

ClearML is an open-source MLOps platform providing experiment tracking, data management, pipelines and model serving.

-- Loading score

EasyR1

EasyR1 is an efficient, scalable RL training framework for multimodal models, based on veRL and optimized for large-model training.

-- Loading score

Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).

-- Loading score

JAX

High-performance Python library for accelerator-oriented array computation and composable program transformations.

-- Loading score

nanoGPT

A minimal, fast repository for training and fine-tuning medium-sized GPT models, suitable for teaching and experiments.

-- Loading score

NeMo RL

NeMo RL is a scalable post-training reinforcement learning library for large models, supporting high-performance distributed training and multiple backends.

-- Loading score

OpenEnv — Agentic Execution Environments

An end-to-end framework for creating, deploying and using isolated execution environments, aimed at agentic RL training and environment development.

-- Loading score

PyTorch Lightning

PyTorch Lightning is an open-source framework that streamlines PyTorch training, enabling efficient model development, training, and deployment.

-- Loading score

RLinf

RLinf is a flexible and scalable open-source RL infrastructure designed for Embodied and Agentic AI, supporting PPO, GRPO, SAC and more, with seamless scaling to large GPU clusters.

-- Loading score

SkyRL

A modular full-stack reinforcement learning (RL) library for large language models (LLMs), designed for long-horizon, real-world tasks.

-- Loading score

xLLM

xLLM is an open-source framework for vision-language models, providing tools and documentation for training and inference.

-- Loading score

SFT, RLHF, preference optimization, and alignment.

AReaL

A fully asynchronous reinforcement learning system for large reasoning and agentic models that emphasizes scalability and reproducibility.

-- Loading score

Axolotl

A free and open-source LLM post-training and fine-tuning framework that supports multiple models, training methods, and distributed optimizations.

-- Loading score

Heretic is a fully automated tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration," with a TPE-based parameter optimizer powered by Optuna to automatically find high-quality ablation parameters by co-minimizing refusals and KL divergence from the original model.

-- Loading score

LMFlow

An extensible, convenient, and efficient toolbox for fine-tuning and inference of large foundation models.

-- Loading score

MLX-VLM

A local-first toolkit for inference and fine-tuning of vision-language and omni models using MLX, optimized for macOS and general hardware.

-- Loading score

OpenRLHF

An easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM and DeepSpeed, supporting distributed and hybrid-engine training.

-- Loading score

Petri

Petri is an alignment auditing agent designed to quickly explore alignment hypotheses and help researchers automate evaluation workflows.

-- Loading score

TRL

TRL is an open-source toolkit from Hugging Face for reinforcement learning training on transformer models.

-- Loading score

Tunix

Tunix is a JAX-native post-training library for LLMs providing efficient fine-tuning, RL training, and distillation tools.

-- Loading score

Unsloth

High-performance toolkit for fine-tuning and reinforcement learning of large models, with memory-efficient kernels and wide model support.

-- Loading score

Evaluation frameworks, benchmark suites, and datasets.

Agenta

Agenta is an open-source LLMOps platform that combines prompt management, evaluation, and observability to help teams ship reliable LLM applications faster.

-- Loading score

DeepEval

DeepEval is an open-source LLM evaluation framework that provides modular metrics and tooling for testing LLM systems and RAG pipelines.

-- Loading score

DeepTeam

An open-source framework for red-teaming large language models and LLM systems, focused on security and robustness evaluation.

-- Loading score

Dingo

A tool for automated data quality evaluation that combines rule-based and model-based assessments.

-- Loading score

EasyEdit

An easy-to-use knowledge editing framework providing multiple editing methods, evaluation metrics and datasets; supports LLMs and some multimodal editing scenarios.

-- Loading score

Evidently

An open-source framework for evaluating, testing, and monitoring ML and LLM systems from experiments to production.

-- Loading score

fuck-u-code

A static analysis tool that assesses codebase 'legacy-mess' and generates readable Markdown reports.

-- Loading score

Giskard OSS

An open-source evaluation and testing framework to detect performance, bias, and security issues in AI systems.

-- Loading score

HELM

Holistic Evaluation of Language Models (HELM) from Stanford CRFM: an open framework for reproducible, transparent model evaluation and benchmark management.

-- Loading score

Inspector

Inspector is a visual testing tool for MCP (Model Context Protocol) servers that helps developers validate and visualize server behavior and responses.

-- Loading score

LightEval

A lightweight toolkit from Hugging Face for fast, flexible LLM evaluation across multiple backends.

-- Loading score

LiveBench

LiveBench is a contamination-aware, objective LLM benchmark suite that provides reproducible question sets, automatic scoring, and an online leaderboard.

-- Loading score

lm-evaluation-harness

The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets.

-- Loading score

MS-SWIFT

SWIFT from ModelScope: a scalable, lightweight infrastructure for fine-tuning, evaluating and deploying large and multimodal models, with training, quantization and inference acceleration support.

-- Loading score

OpenCompass

A one-stop platform for evaluating large models, providing benchmarks, evaluation toolkits and leaderboards to reproduce and compare model capabilities.

-- Loading score

OpenLIT

OpenLIT is an open-source platform for AI engineering that provides LLM observability, prompt management, evaluations and guardrails.

-- Loading score

Opik

Opik is an open-source LLM evaluation and observability platform that helps teams build, evaluate and optimize LLM applications.

-- Loading score

Promptfoo

Promptfoo is a developer-first, local LLM testing and red-teaming tool for automated evaluations, vulnerability scanning, and CI integration.

-- Loading score

Ragas

Ragas is an open-source toolkit for evaluating and optimizing LLM applications, offering objective metrics, test data generation, and production feedback loops.

-- Loading score

ReLE Chinese LLM Benchmark

ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation and leaderboard project covering education, medical, finance, legal, reasoning and other capability dimensions.

-- Loading score

TileLang

TileLang is a domain-specific language for high-performance AI kernels that simplifies writing GPU/CPU/accelerator operators.

-- Loading score

Torchtune

A PyTorch-native post-training and fine-tuning toolkit providing reusable recipes, optimizations, and quantization support for LLM training and evaluation.

-- Loading score

Tracing, logging, and runtime observability.

Helicone

Helicone is an open-source LLM observability and analytics platform that captures requests, traces and sessions to help developers debug, evaluate and optimize model usage.

-- Loading score

OpenLLMetry

An OpenTelemetry-inspired observability toolkit for LLM/AI, providing request tracing and metrics aggregation for diagnostics and monitoring.

-- Loading score

Polyaxon

Polyaxon is an MLOps platform for managing, training and monitoring large-scale machine learning workloads.

-- Loading score

Prompt management, versioning, and quality tooling.

Code2Prompt

A tool that converts codebases into a single LLM prompt for code analysis, generation, and automation workflows.

-- Loading score

PR-Agent

An open-source AI-powered code review and PR assistant that runs locally, in CI, or self-hosted; supports multi-platform integrations and customizable prompts.

-- Loading score

Repomix

A tool that packs an entire repository into an AI-friendly file, making it easy to provide structured code context to large models.

-- Loading score

Safety filters, guardrails, and risk controls.