Here’s Why AI Benchmarks Tell Us Little
AI

Here’s Why AI Benchmarks Tell Us Little

Mar 8, 2024

Recently many start-up companies have launched their Generative AI tools in the market. All are claiming to achieve the best performance in the market Google argued the same on the launch of Gemini, what Openi claims for ChatGPT, including startups like Anthronic which has recently released a Gen AI family tool.

Each of them is claiming for the never-seen performance of AI, better for another at some different benchmarks. But have you ever wondered about what these AI benchmarks are, and for which companies are claiming?

With the release of every new GenAI project or on the announcement of to-be-released projects, companies openly talk about their features, applications, performance, and many criteria. But have you seen a company talking about these benchmarks in their release?

Also read – Adobe Introduced a New AI Tool

Companies AI benchmarks:

All companies whether they are startups or tech giants are using some benchmarks for optimizing the performance of their AI. Latest generative tools especially Chatbots like Gemini and Chatgpt are also working on different benchmarks for optimizing their performance.

When companies release their AI tools, they provide a piece of detailed information about different benchmarks and the performance of the tools on these benchmarks. “Benchmarks are typically static and narrowly focused on evaluating a single capability, like a model’s factuality in a single domain, or its ability to solve mathematical reasoning multiple choice questions,”- said Dodge.

‘Many benchmarks used for evaluation are three-plus years old, from when AI systems were mostly just used for research and didn’t have many real users. In addition, people use generative AI in many ways — they’re very creative.”- He added.

The most common benchmarks AI companies are using are GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), Anthropic said in their release of Claude 3. GPQA generally contains Ph.D.-level biology, physics, and chemistry questions. But it somehow does not capture the details of what normal people are using the Chatbots.

Generally, common people are using these chatbots for writing their emails for offices, applications, covering letters, and other basic stuff which is generally not included in the benchmarks AI companies are using.

Also, read – Political deep fakes are spreading like wildfire thanks to GenAI

Not useless either:

However, it is not right to say that the AI benchmarks that companies are using currently are not completely useless. These benchmarks offer a reliable solution to the people and institutes in their complicated tasks. However, with the increase in the AI market, the old benchmarks are becoming less applicable.

“Older AI systems were often built to solve a particular problem in a context (e.g. medical AI expert systems), making a deeply contextual understanding of what constitutes good performance in that particular context more possible,” said Widder.

“The right path forward, here, is a combination of evaluation benchmarks with human evaluation,” she said, “prompting a model with a real user query and then hiring a person to rate how good the response is.”- Widder.

However, the benchmarks AI companies are using can be more relevant to the common users, than they are now.

Leave a Reply

Your email address will not be published. Required fields are marked *