2026.1.5

Why We Built VIBE Bench: Rethinking Evaluation for Real Workloads

         A full-stack application benchmark focused on real user experience


VIBE Bench: A Full-Stack Application Evaluation Benchmark for Real User Experience

To measure a model’s full-stack capability to build complete, runnable applications from zero to one, MiniMax introduces a new benchmark: VIBE (Visual & Interactive Benchmark for Execution).
Unlike traditional benchmarks, VIBE automatically evaluates the interaction logic and visual presentation of generated applications in a real execution environment, providing a more faithful assessment of real user experience.

Background & Motivation

When we evaluate large language models today, most widely used benchmarks, such as SWE-bench and Terminal-bench, focus on static code correctness or command-line–level task completion.
These benchmarks have been extremely valuable for measuring coding ability. But they also rely on an implicit assumption:
If the generated code is logically correct and passes tests, the task is considered complete.

In real-world usage, that assumption often falls short.
What users actually care about is whether:
  • the application can be successfully built and launched;
  • core features work through real user interaction;
  • interactions behave as expected;
  • the interface looks modern, polished, and professional.
In other words, code that runs is not the same as a product that users can actually use.

Despite this gap, there has been no benchmark that systematically evaluates whether model-generated applications truly deliver a usable end-to-end experience.
This is the motivation behind VIBE (Visual & Interactive Benchmark), a full-stack evaluation benchmark designed around real user experience.

VIBE Bench Overview

VIBE is built to evaluate the entire lifecycle of an application, focusing on how model-generated apps perform in real execution environments.
Unlike existing benchmarks that mainly target Web or backend development,VIBE deliberately includes several critical but often overlooked technical domains, such as:
  • Native Android development (Kotlin / Java)
  • Native iOS development (Swift / Objective-C)
  • High-fidelity Scientific Simulations, where both precise computation and realistic rendering matter

To reflect the diversity of real-world development, the VIBE dataset is organized into the following subsets by technology stack:
  • Web: Frontend applications that demand strong visual design and complex DOM interactions
  • Simulation: Scientific simulation tasks requiring high-fidelity rendering and accurate numerical computation
  • Android: Native Android application development (Kotlin / Java)
  • iOS: Native iOS application development (Swift / Objective-C)
  • Backend: Server-side systems that emphasize API completeness and overall system architecture

Core Approach: Agent-as-a-Verifier (AaaV)

At the core of VIBE is a new verification paradigm we call Agent-as-a-Verifier (AaaV).
VIBE uses a vision-enabled agent to act as an automated QA tester, rather than relying on hand-written rules or static tests. This agent interacts with model-generated applications directly, observing both their behavior and visual output.
Running inside a sandboxed environment, the agent performs end-to-end evaluation of each application (coming soon), from launch and functional interaction to layout, rendering, and overall visual presentation.
By shifting verification from predefined rules to agent-driven interaction, AaaV allows VIBE to evaluate applications in a way that more closely mirrors how real users experience software.

The Three Evaluation Layers of VIBE Bench (1)

Execution Layer

The first question VIBE asks is a very basic one:
Can the application actually survive?
At the execution layer, we check whether the generated app can make it through the most fundamental hurdles:
  • Does the project compile successfully
  • Can the application be built without errors
  • Does it launch and run without crashing

VIBE Bench's Three Evaluation Levels (2)

Interaction Layer Validates whether core functions are "usable"
  • Whether interactions are responsive
  • Whether business workflows can be completed end-to-end
  • Whether key functions align with user intent

VIBE Bench's Three Evaluation Levels (3)

Visual & Aesthetics Layer Validates whether the interface has "production-grade presentation"
  • Whether the layout is reasonable and professional
  • Whether the visual hierarchy is clear
  • Whether the color scheme is harmonious
  • Whether the overall style complies with modern UI design standards

From "Is the code correct?" to "Is the application usable and deliverable?"

VIBE Bench reflects a critical phase transition in the evolution of model capabilities:
VIBE Bench provides a unified, scalable standard for evaluating and training full-stack generative models oriented toward real-world scenarios, committed to advancing model capabilities from "technical correctness" to "practical deployment value."
https://huggingface.co/datasets/MiniMaxAI/VIBE/blob/main/README.md?code=true
logo
© 2026 MiniMax