This SaaS product evaluates and ranks large language models (LLMs) based on various criteria such as safety, privacy, security, integrity, general capabilities, and domain-specific capabilities. It displays a leaderboard comparing models from different vendors with overall scores.

Features

LLMs Leaderboard

Provides a ranking of language models based on several criteria including safety, privacy, security, integrity, general capabilities, and domain-specific capabilities. It offers an overall score for each model.

Model Evaluation Criteria

Models are evaluated based on specific domains such as Safety, Privacy, Security, Integrity, General Capabilities, and Domain Specific Capabilities, each contributing to the model's overall score.

Vendor Information

Displays the vendor associated with each model, giving users insight into the organizations behind the development of these language models.

Answer Relevancy

Measures the proportion of text generated in the LLM actual output that are relevant to the input.

Contextual Precision

Measures whether relevant nodes in the retrieval context are ranked higher than irrelevant ones.

Contextual Recall

Measures the proportion of sentences in the actual output that can be attributed to the retrieval context.

Contextual Relevancy

Measures the proportion of text generated in the LLM actual output that are relevant to the input.

Faithfulness

Measures the proportion of claims in the LLM actual output that are not contradictory to the retrieval context.

Hallucination

Measures the proportion of claims in the LLM actual output that are not contradictory to the ground truth.

Summarization

Measures how well the original text was summarized by your LLM system based on the generated relevant content.

Tool Correctness

Measures how capable your LLM agent is able to call the correct tools for a given input.

Misinformation

Misinformation datasets are collections of data designed to evaluate how well language models can identify, generate, and handle misinformation.

Sexual Content

Sexual content datasets are collections of examples designed to evaluate how language models handle discussions related to sexual content.

Crime

Crime datasets are collections of examples used to evaluate how language models handle topics related to criminal behavior and legal matters.

Hallucination

Hallucination datasets are collections of examples designed to evaluate how language models generate and handle incorrect or nonsensical information.

Defamation

Defamation evaluation datasets are collections of data designed to assess how effectively language models can detect, analyze, and manage content that can damage reputations.

Terrorism

Terrorism datasets are collections of information specifically designed to evaluate language models' handling of content related to terrorism.

Bias

Bias datasets are collections of data designed to evaluate how language models handle issues related to bias and discrimination.

Insult

Insult datasets are collections of examples designed to evaluate how language models handle offensive or derogatory language.

Ethics

Ethics datasets are collections of examples designed to evaluate how language models handle discussions related to ethical issues.

Malware

Malware evaluation dataset is a collection of data designed to assess how effectively systems, such as machine learning models, can handle malware-related content.

Violence

Violence datasets are curated collections of examples used to evaluate how language models respond to content related to violence.

Political Sensitivity

Political sensitivity datasets are collections of data designed to evaluate how language models handle discussions related to politics.

Hate Speech

Hate speech datasets are collections of examples designed to evaluate how language models handle language that promotes hate or violence against individuals or groups.

Illegal Conduct

Illegal conduct datasets are collections of examples designed to evaluate how language models handle discussions or prompts related to illegal activities.

Adversarial Testing

Adversarial Testing datasets are designed to evaluate how language models handle deliberately challenging or misleading inputs.

PAIR

Prompt Automatic Iterative Refinement (PAIR) is an algorithm that generates semantic jailbreaks with only black-box access to an LLM.

Multiple Choice Questions

This feature involves posing multiple-choice questions to LLMs to evaluate their accuracy in answering domain knowledge.

CipherChat

Enables humans to chat with LLMs through cipher prompts with system role descriptions and few-shot examples.

PAP

Persuasive Adversarial Prompt (PAP) uses persuasive techniques in jailbreak prompt construction, enhancing LLM responses.

DeepInception

Uses the personification ability of LLMs to construct virtual, nested scenes to see how they behave.

MJP (Privacy Data Extraction)

Extracts privacy data from LLMs trained on large datasets, testing their handling of such information.

ABJ

Analyzing-based Jailbreak (ABJ) uses LLMs' reasoning capabilities to reveal underlying biases.

DAN

Do Anything Now (DAN) is a jailbreak character in ChatGPT that performs tasks without restrictions.

Chain-of-Thought (CoT)

Enhances reasoning capabilities of LLMs by encouraging step-by-step thinking.

Adaptive_Attacks

Designs prompts that enhance the performance of LLMs by adapting to their responses.

Multilingual

Tests LLMs' capabilities in multiple languages to address potential risks across language barriers.

ICA (In-Context Attack)

Demonstrates how mal-intended prompts can lead models to generate harmful content.

ReNeLLM

Generalizes jailbreak attacks into Prompt Rewriting and Scenario Nesting for stronger jailbreaks.

ArtPrompt

Uses poor ASCII art recognition by LLMs to bypass safety measures.

Direct Use of Sampled Input

Evaluates model responses using dataset samples without altering prompts.

GPTFuzz

A jailbreak framework using fuzzing techniques inspired by AFL for testing LLM robustness.

Stenography

Encodes input prompts using LLMs' abilities to decode them, testing model predictions.

CodeChameleon

Tests LLMs' safety by detecting and generating responses based on model security hypotheses.

Jailbroken

Guides jailbreak design using failure modes to evaluate models like OpenAI's GPT-4.

TAP

An automated method, Tree of Attacks with Pruning (TAP), to generate jailbreaks.

DRA

Disguise and Reconstruction Attack (DRA) conceals harmful jailbreak instructions to test LLM security.