Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. 

However, these benchmarks often test for general capabilities. For organizations that want to use models and large language model-based agents, it’s harder to evaluate how well the agent or the model actually understands their specific needs. 

Model repository Hugging Face launched Yourbench, an open-source tool where developers and enterprises can create their own benchmarks to test model performance against their internal data. 

okex

Sumuk Shashidhar, part of the evaluations research team at Hugging Face, announced Yourbench on X. The feature offers “custom benchmarking and synthetic data generation from ANY of your documents. It’s a big step towards improving how model evaluations work.”

He added that Hugging Face knows “that for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.”

Creating custom evaluations

Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark “using minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.” 

Organizations need to pre-process their documents before Yourbench can work. This involves three stages:

Document Ingestion to “normalize” file formats.

Semantic Chunking to break down the documents to meet context window limits and focus the model’s attention.

Document Summarization

Next comes the question-and-answer generation process, which creates questions from information on the documents. This is where the user brings in their chosen LLM to see which one best answers the questions. 

Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku.

Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash “produce tremendous value for very very low costs.”

Compute limitations

However, creating custom LLM benchmarks based on an organization’s documents comes at a cost. Yourbench requires a lot of compute power to work. Shashidhar said on X that the company is “adding capacity” as fast they could.

Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage.

Benchmarking is not perfect

Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily.

Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.”

However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. This has led to different methods to test model performance and reliability. 

Google DeepMind introduced FACTS Grounding, which tests a model’s ability to generate factually accurate responses based on information from documents. Some Yale and Tsinghua University researchers developed self-invoking code benchmarks to guide enterprises for which coding LLMs work for them. 



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

XLM Stellar
Coinbase
XLM Stellar
Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data
okex
Changelly
From MIPS to exaflops in mere decades: Compute power is exploding, and it will transform AI
DeepSeek jolts AI industry: Why AI's next leap may not come from more data, but more compute at inference
Meta's answer to DeepSeek is here: Llama 4 launches with long context Scout and Maverick models, and 2T parameter Behemoth on the way!
Genspark’s Super Agent ups the ante in the general AI agent race
bitcoin
ethereum
bnb
xrp
cardano
solana
dogecoin
polkadot
shiba-inu
dai
Betfury
From MIPS to exaflops in mere decades: Compute power is exploding, and it will transform AI
MyCryptoParadise Crypto Signals Team Called the $19K Bottom and $109K Top – Here’s Their Next Target
‘Climax of uncertainty’ before crypto market recovery
Senator Ted Cruz introduces FLARE Act to repurpose flared gas for Bitcoin mining
From MIPS to exaflops in mere decades: Compute power is exploding, and it will transform AI
MyCryptoParadise Crypto Signals Team Called the $19K Bottom and $109K Top – Here’s Their Next Target
‘Climax of uncertainty’ before crypto market recovery
bitcoin
ethereum
tether
xrp
bnb
usd-coin
solana
dogecoin
tron
cardano
bitcoin
ethereum
tether
xrp
bnb
usd-coin
solana
dogecoin
tron
cardano