2026.1.5

Why We Built VIBE Bench: Rethinking Evaluation for Real Workloads

         A full-stack application benchmark focused on real user experience


VIBE Bench: A Full-Stack Application Evaluation Benchmark for Real User Experience

To measure a model’s full-stack capability to build complete, runnable applications from zero to one, MiniMax introduces a new benchmark: VIBE (Visual & Interactive Benchmark for Execution).
Unlike traditional benchmarks, VIBE automatically evaluates the interaction logic and visual presentation of generated applications in a real execution environment, providing a more faithful assessment of real user experience.

Background & Motivation

When we evaluate large language models today, most widely used benchmarks, such as SWE-bench and Terminal-bench, focus on static code correctness or command-line–level task completion.
These benchmarks have been extremely valuable for measuring coding ability. But they also rely on an implicit assumption:
If the generated code is logically correct and passes tests, the task is considered complete.

In real-world usage, that assumption often falls short.
What users actually care about is whether:
In other words, code that runs is not the same as a product that users can actually use.

Despite this gap, there has been no benchmark that systematically evaluates whether model-generated applications truly deliver a usable end-to-end experience.
This is the motivation behind VIBE (Visual & Interactive Benchmark), a full-stack evaluation benchmark designed around real user experience.

VIBE Bench Overview

VIBE is built to evaluate the entire lifecycle of an application, focusing on how model-generated apps perform in real execution environments.
Unlike existing benchmarks that mainly target Web or backend development,VIBE deliberately includes several critical but often overlooked technical domains, such as:
To reflect the diversity of real-world development, the VIBE dataset is organized into the following subsets by technology stack:

Core Approach: Agent-as-a-Verifier (AaaV)

At the core of VIBE is a new verification paradigm we call Agent-as-a-Verifier (AaaV).
VIBE uses a vision-enabled agent to act as an automated QA tester, rather than relying on hand-written rules or static tests. This agent interacts with model-generated applications directly, observing both their behavior and visual output.
Running inside a sandboxed environment, the agent performs end-to-end evaluation of each application (coming soon), from launch and functional interaction to layout, rendering, and overall visual presentation.
By shifting verification from predefined rules to agent-driven interaction, AaaV allows VIBE to evaluate applications in a way that more closely mirrors how real users experience software.

The Three Evaluation Layers of VIBE Bench (1)

Execution Layer

The first question VIBE asks is a very basic one:
Can the application actually survive?
At the execution layer, we check whether the generated app can make it through the most fundamental hurdles:

VIBE Bench's Three Evaluation Levels (2)

Interaction Layer Validates whether core functions are "usable"

VIBE Bench's Three Evaluation Levels (3)

Visual & Aesthetics Layer Validates whether the interface has "production-grade presentation"

From "Is the code correct?" to "Is the application usable and deliverable?"

VIBE Bench reflects a critical phase transition in the evolution of model capabilities:
VIBE Bench provides a unified, scalable standard for evaluating and training full-stack generative models oriented toward real-world scenarios, committed to advancing model capabilities from "technical correctness" to "practical deployment value."
https://huggingface.co/datasets/MiniMaxAI/VIBE/blob/main/README.md?code=true