Why AI Metrics Are Misleading

What is a benchmark? In the world of AI, we often find it helpful to talk about how well a model performs a task. You might, for example, hear a friend gush over how GPT4 is so much better compared to GPT3 because its answers are so much more accurate and coherent. In an attempt to rigorously assess model performance, researchers had devised a more systematic approach called “benchmarking”. Benchmarks help us evaluate how good our AI models are, for example:...

September 12, 2024 · Khiem, Zed, Nam, NotAd, Tost, Jord Nguyen

Can Language Models Determine Where You Live with Just a Single Photo?

TL;DR Members from HAISN participated in a research sprint where they tried to assess AI models’ ability to infer location from single images. And they found out that depending on what clues there are in an image, GPT-4o can land only a few hundreds meters from where you are, or it can be a few continents off. Introduction We all know that AI models are surprisingly good at conversing, summarizing essays, writing code [7] [9] [13] , passing the Turing test [10] and explaining complex topics in the style of Jerry Seinfeld [15] ....

August 28, 2024 · Le 'Qronox' Lam, Aleksandr Popov, Jord Nguyen, Trung Dung 'mogu' Hoang, Marcel M, Felix Michalak

rAInboltBench: Benchmarking user location inference through single images

Abstract This paper introduces rAInboltBench, a comprehensive benchmark designed to evaluate the capability of multimodal AI models in inferring user locations from single images. The increasing proficiency of large language models with vision capabilities has raised concerns regarding privacy and user security. Our benchmark addresses these concerns by analysing the performance of state-of-the-art models, such as GPT-4o, in deducing geographical coordinates from visual inputs. By Le “Qronox” Lam, Aleksandr Popov, Jord Nguyen, Trung Dung “mogu” Hoang, Marcel M, Felix Michalak...

May 27, 2024

Benchmarking Dark Patterns in LLMs

Abstract This paper builds upon the research in Seemingly Human: Dark Patterns in ChatGPT (Park et al, 2024), by introducing a new benchmark of 392 questions designed to elicit dark pattern behaviours in language models. We ran this benchmark on GPT-4 Turbo and Claude 3 Sonnet, and had them self-evaluate and cross-evaluate the responses By Jord Nguyen, Akash Kundu, Sami Jawhar Read the paper here

May 27, 2024