DeepEP is a communication library optimized for Mixture-of-Experts (MoE) and expert parallelism (EP), providing high-throughput and low-latency communication solutions for high-performance machine learning environments.

Features

High-throughput GPU kernels

Provides all-to-all high-throughput GPU kernels for efficient Mixture-of-Experts dispatch and combine operations, supporting complex learning environments.

Low-latency inference kernels

Includes specialized kernels for latency-sensitive inference tasks, minimizing delays through pure RDMA operations.

FP8 and low-precision support

Utilizes FP8 and other low-precision operations to increase computational efficiency without sacrificing performance.

Hook-based overlapping method

Introduces a method to overlap communication and computation tasks to enhance performance without using Streaming Multiprocessors (SM) resources.

Configuration for NVLink and RDMA

Supports advanced configurations utilizing NVLink and RDMA for both intranode and internode communications, facilitating broader application.