Training, Evaluation & Optimization

Training frameworks, fine-tuning, evaluation, observability, and optimization workflows.

81 Projects 8 Subcategory 49 Tags

TrackedUnavailableArchivedInactive

Training frameworks and distributed training systems.

AXLearn

An extensible deep learning library built on JAX/XLA, designed for developing, training and deploying large-scale models.

-- Loading score

ClearML

ClearML is an open-source MLOps platform providing experiment tracking, data management, pipelines and model serving.

-- Loading score

Colossal-AI

Discover Colossal-AI: an open-source solution for efficient large-scale training and inference, featuring advanced parallelism and memory management for optimal performance.

-- Loading score

DeepEP

An efficient expert-parallel communication library that provides low-overhead communication primitives for large-scale distributed training.

-- Loading score

DeepSpeed

A high-performance library for training and inference that dramatically speeds up large-scale deep learning while reducing cost.

-- Loading score

DLRover

DLRover is an automatic distributed deep learning system that provides elastic scheduling, flash checkpointing and auto-scaling to simplify large-scale model training on Kubernetes and Ray.

-- Loading score

EasyR1

EasyR1 is an efficient, scalable RL training framework for multimodal models, based on veRL and optimized for large-model training.

-- Loading score

Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).

-- Loading score

JAX

High-performance Python library for accelerator-oriented array computation and composable program transformations.

-- Loading score

LightGBM

A fast, distributed, high-performance gradient boosting framework for decision tree algorithms, widely used for ranking, classification, and large-scale ML tasks.

-- Loading score

LitGPT

A high-performance, engineering-focused LLM toolkit that provides end-to-end recipes and practical tutorials for training and deploying large models.

-- Loading score

MaxText

A high-performance, highly scalable open-source LLM library and reference implementation built with Python and JAX, targeting Google Cloud TPUs and GPUs.

-- Loading score

Megatron-LM

Reference implementation from NVIDIA for large-scale model training and inference with distributed optimizations.

-- Loading score

MONAI

An AI toolkit for healthcare imaging focused on deep learning workflows for medical image processing and analysis.

-- Loading score

nanoGPT

A minimal, fast repository for training and fine-tuning medium-sized GPT models, suitable for teaching and experiments.

-- Loading score

NeMo

NVIDIA's NeMo framework for speech, TTS, multimodal and LLM training & fine-tuning.

-- Loading score

NeMo RL

NeMo RL is a scalable post-training reinforcement learning library for large models, supporting high-performance distributed training and multiple backends.

-- Loading score

OpenEnv — Agentic Execution Environments

An end-to-end framework for creating, deploying and using isolated execution environments, aimed at agentic RL training and environment development.

-- Loading score

PyTorch

An open-source deep learning framework for fast, flexible research and production, featuring dynamic computation graphs and strong GPU acceleration.

-- Loading score

PyTorch Lightning

PyTorch Lightning is an open-source framework that streamlines PyTorch training, enabling efficient model development, training, and deployment.

-- Loading score

RLinf

RLinf is a flexible and scalable open-source RL infrastructure designed for Embodied and Agentic AI, supporting PPO, GRPO, SAC and more, with seamless scaling to large GPU clusters.

-- Loading score

ROLL

Reinforcement Learning Optimization platform for large-scale training and pipelines.

-- Loading score

SkyRL

A modular full-stack reinforcement learning (RL) library for large language models (LLMs), designed for long-horizon, real-world tasks.

-- Loading score

TensorFlow

Google's open-source end-to-end machine learning platform for building and training deep learning models.

-- Loading score

TorchTitan

A PyTorch-native platform for generative model pretraining and distributed optimization.

-- Loading score

verl

A reinforcement learning training framework for large models, designed for scalable RLHF and agent training.

-- Loading score

xLLM

xLLM is an open-source framework for vision-language models, providing tools and documentation for training and inference.

-- Loading score

SFT, RLHF, preference optimization, and alignment.

AReaL

A fully asynchronous reinforcement learning system for large reasoning and agentic models that emphasizes scalability and reproducibility.

-- Loading score

Axolotl

A free and open-source LLM post-training and fine-tuning framework that supports multiple models, training methods, and distributed optimizations.

-- Loading score

Heretic is a fully automated tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration," with a TPE-based parameter optimizer powered by Optuna to automatically find high-quality ablation parameters by co-minimizing refusals and KL divergence from the original model.

-- Loading score

LLaMA Factory

A comprehensive framework for fine-tuning LLaMA models with multiple training methods, efficient algorithms, and easy-to-use interface for both research and production environments.

-- Loading score

LMFlow

An extensible, convenient, and efficient toolbox for fine-tuning and inference of large foundation models.

-- Loading score

MLX-VLM

A local-first toolkit for inference and fine-tuning of vision-language and omni models using MLX, optimized for macOS and general hardware.

-- Loading score

MS-SWIFT

SWIFT from ModelScope: a scalable, lightweight infrastructure for fine-tuning, evaluating and deploying large and multimodal models, with training, quantization and inference acceleration support.

-- Loading score

OpenRLHF

An easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM and DeepSpeed, supporting distributed and hybrid-engine training.

-- Loading score

PEFT

State-of-the-art parameter-efficient fine-tuning methods for large language models, enabling adapter-based training with minimal GPU resources.

-- Loading score

Torchtune

A PyTorch-native post-training and fine-tuning toolkit providing reusable recipes, optimizations, and quantization support for LLM training and evaluation.

-- Loading score

Transformer Lab

Explore Transformer Lab, the open-source app for downloading and fine-tuning large models locally or in the cloud with powerful tools and multi-engine support.

-- Loading score

TRL

TRL is an open-source toolkit from Hugging Face for reinforcement learning training on transformer models.

-- Loading score

Tunix

Tunix is a JAX-native post-training library for LLMs providing efficient fine-tuning, RL training, and distillation tools.

-- Loading score

Unsloth

High-performance toolkit for fine-tuning and reinforcement learning of large models, with memory-efficient kernels and wide model support.

-- Loading score

Experiment tracking, model ops, and training pipelines.

Google Research

Google Research aggregates open-source research code and datasets from Google, covering machine learning, vision, NLP and other research areas.

-- Loading score

Skypilot

Skypilot is an open-source tool to automate distributed training and inference across cloud and on-premises clusters, simplifying resource provisioning and environment setup.

-- Loading score

Slurm

Slurm is an open-source cluster resource management and job scheduling system that is simple, scalable, portable, fault-tolerant, and interconnect agnostic, widely used in high-performance computing and AI training clusters.

-- Loading score

SwanLab

SwanLab is an open-source, modern training tracking and visualization tool that supports cloud and self-hosted deployment.

-- Loading score

Weights & Biases (W&B)

A machine learning development and observability platform for tracking experiments, managing models and artifacts, and visualizing results across the ML lifecycle.

-- Loading score

ZenML

A unified MLOps framework to develop, evaluate and deploy everything from classical models to multi-agent AI systems.

-- Loading score

Evaluation frameworks, benchmark suites, and datasets.

Agenta

Agenta is an open-source LLMOps platform that combines prompt management, evaluation, and observability to help teams ship reliable LLM applications faster.

-- Loading score

DeepEval

DeepEval is an open-source LLM evaluation framework that provides modular metrics and tooling for testing LLM systems and RAG pipelines.

-- Loading score

DeepTeam

An open-source framework for red-teaming large language models and LLM systems, focused on security and robustness evaluation.

-- Loading score

Dingo

A tool for automated data quality evaluation that combines rule-based and model-based assessments.

-- Loading score

EasyEdit

An easy-to-use knowledge editing framework providing multiple editing methods, evaluation metrics and datasets; supports LLMs and some multimodal editing scenarios.

-- Loading score

Evidently

An open-source framework for evaluating, testing, and monitoring ML and LLM systems from experiments to production.

-- Loading score

fuck-u-code

A static analysis tool that assesses codebase 'legacy-mess' and generates readable Markdown reports.

-- Loading score

Giskard OSS

An open-source evaluation and testing framework to detect performance, bias, and security issues in AI systems.

-- Loading score

HELM

Holistic Evaluation of Language Models (HELM) from Stanford CRFM: an open framework for reproducible, transparent model evaluation and benchmark management.

-- Loading score

Inspector

Inspector is a visual testing tool for MCP (Model Context Protocol) servers that helps developers validate and visualize server behavior and responses.

-- Loading score

Keploy

A developer-centric API and integration testing tool that auto-generates tests and data mocks from real traffic, supporting record-and-replay of API calls, database operations, and streaming events.

-- Loading score

LightEval

A lightweight toolkit from Hugging Face for fast, flexible LLM evaluation across multiple backends.

-- Loading score

LiveBench

LiveBench is a contamination-aware, objective LLM benchmark suite that provides reproducible question sets, automatic scoring, and an online leaderboard.

-- Loading score

lm-evaluation-harness

The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets.

-- Loading score

OpenCompass

A one-stop platform for evaluating large models, providing benchmarks, evaluation toolkits and leaderboards to reproduce and compare model capabilities.

-- Loading score

OpenLIT

OpenLIT is an open-source platform for AI engineering that provides LLM observability, prompt management, evaluations and guardrails.

-- Loading score

Opik

Opik is an open-source LLM evaluation and observability platform that helps teams build, evaluate and optimize LLM applications.

-- Loading score

Petri

Petri is an alignment auditing agent designed to quickly explore alignment hypotheses and help researchers automate evaluation workflows.

-- Loading score

Promptfoo

Promptfoo is a developer-first, local LLM testing and red-teaming tool for automated evaluations, vulnerability scanning, and CI integration.

-- Loading score

Ragas

Ragas is an open-source toolkit for evaluating and optimizing LLM applications, offering objective metrics, test data generation, and production feedback loops.

-- Loading score

ReLE Chinese LLM Benchmark

ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation and leaderboard project covering education, medical, finance, legal, reasoning and other capability dimensions.

-- Loading score

Shapash

Generates interactive visual reports to explain machine learning model predictions for stakeholders.

-- Loading score

Tracing, logging, and runtime observability.

Helicone

Helicone is an open-source LLM observability and analytics platform that captures requests, traces and sessions to help developers debug, evaluate and optimize model usage.

-- Loading score

Langfuse

Discover Langfuse, the open-source platform for LLM development, enhancing collaboration, monitoring, and debugging for AI applications.

-- Loading score

OpenLLMetry

An OpenTelemetry-inspired observability toolkit for LLM/AI, providing request tracing and metrics aggregation for diagnostics and monitoring.

-- Loading score

Phoenix

Phoenix is a high-performance web framework built with Elixir, optimized for realtime, distributed, and scalable web applications.

-- Loading score

Polyaxon

Polyaxon is an MLOps platform for managing, training and monitoring large-scale machine learning workloads.

-- Loading score

Prompt management, versioning, and quality tooling.

Code2Prompt

A tool that converts codebases into a single LLM prompt for code analysis, generation, and automation workflows.

-- Loading score

GuideLLM

GuideLLM offers tooling for guiding, interpreting, and controlling large language models (LLMs), enabling better controllability in interactive applications.

-- Loading score

PR-Agent

An open-source AI-powered code review and PR assistant that runs locally, in CI, or self-hosted; supports multi-platform integrations and customizable prompts.

-- Loading score

Safety filters, guardrails, and risk controls.

Compiler optimization, autotuning, and simulation.

Newton

A GPU-accelerated physics simulation engine built on NVIDIA Warp, targeting robotics and simulation research.

-- Loading score