Technology
DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Large language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks many possibilities for developers.
But how well do these long-context LLMs really understand and utilize the vast amounts of information they receive?
Researchers at Google DeepMind have introduced Michelangelo, a new benchmark designed to evaluate the long-context reasoning capabilities of LLMs. Their findings, published in a new research paper, show that while current frontier models have progressed in retrieving information from large in-context data, they still struggle with tasks that require reasoning over the data structure.
The need for better long-context benchmarks
The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the focus has been on retrieval tasks, such as the popular “needle-in-a-haystack” evaluation, where the model is tasked with finding a specific piece of information within a large context.
“Over time, models have grown considerably more capable in long context performance,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For instance, the popular needle-in-a-haystack evaluation for retrieval has now been well saturated up to extremely long context lengths. Thus, it has become important to determine whether the harder tasks models are capable of solving in short context regimes are also solvable at long ranges.”
Retrieval tasks don’t necessarily reflect a model’s capacity for reasoning over the entire context. A model might be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that evaluate a model’s ability to reason over long contexts have limitations.
“It is easy to develop long reasoning evaluations which are solvable with a combination of only using retrieval and information stored in model weights, thus ‘short-circuiting’ the test of the model’s ability to use the long-context,” Vodrahalli said.
Michelangelo
To address the limitations of current benchmarks, the researchers introduced Michelangelo, a “minimal, synthetic, and unleaked long-context reasoning evaluation for large language models.”
Michelangelo is based on the analogy of a sculptor chiseling away irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on evaluating the model’s ability to understand the relationships and structure of the information within its context window, rather than simply retrieving isolated facts.
The benchmark consists of three core tasks:
Latent list: The model must process a long sequence of operations performed on a Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “Latent List measures the ability of a model to track a latent data structure’s properties over the course of a stream of code instructions,” the researchers write.
Multi-round co-reference resolution (MRCR): The model must produce parts of a long conversation between a user and an LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even when the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understanding ordering in natural text, to distinguish between similar drafts of writing, and to reproduce a specified piece of previous context subject to adversarially difficult queries,” the researchers write.
“I don’t know” (IDK): The model is given a long story and asked to answer multiple-choice questions about it. For some questions, the context does not contain the answer, and the model must be able to recognize the limits of its knowledge and respond with “I don’t know.” “IDK measures the model’s ability to understand whether it knows what it doesn’t know based on the presented context,” the researchers write.
Latent Structure Queries
The tasks in Michelangelo are based on a novel framework called Latent Structure Queries (LSQ). LSQ provides a general approach for designing long-context reasoning evaluations that can be extended to arbitrary lengths. It can also test the model’s understanding of implicit information as opposed to retrieving simple facts. LSQ relies on synthesizing test data to avoid the pitfalls of test data leaking into the training corpus.
“By requiring the model to extract information from structures rather than values from keys (sculptures from marble rather than needles from haystacks), we can more deeply test language model context understanding beyond retrieval,” the researchers write.
LSQ has three key differences from other approaches to evaluating long-context LLMs. First, it has been explicitly designed to avoid short-circuiting flaws in evaluations that go beyond retrieval tasks. Second, it specifies a methodology for increasing task complexity and context length independently. And finally, it is general enough to capture a large range of reasoning tasks. The three tests used in Michelangelo cover code interpretation and reasoning over loosely written text.
“The goal is that long-context beyond-reasoning evaluations implemented by following LSQ will lead to fewer scenarios where a proposed evaluation reduces to solving a retrieval task,” Vodrahalli said.
Evaluating frontier models on Michelangelo
The researchers evaluated ten frontier LLMs on Michelangelo, including different variants of Gemini, GPT-4 and 4o, and Claude. They tested the models on contexts up to 1 million tokens. Gemini models performed best on MRCR, GPT models excelled on Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.
However, all models exhibited a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have room to improve in their ability to reason over large amounts of information.
“Frontier models have room to improve on all of the beyond-retrieval reasoning primitives (Latent List, MRCR, IDK) that we investigate in Michelangelo,” Vodrahalli said. “Different frontier models have different strengths and weaknesses – each class performs well on different context ranges and on different tasks. What does seem to be universal across models is the initial drop in performance on long reasoning tasks.”
The Michelangelo evaluations capture basic primitives necessary for long-context reasoning and the findings can have important implications for enterprise applications. For example, in real-world applications where the model can’t rely on its pretraining knowledge and must perform multi-hop reasoning over many disparate locations in very long contexts, Vodrahalli expects performance to drop as the context length grows.
“This is particularly true if the documents have a lot of information that is irrelevant to the task at hand, making it hard for a model to easily immediately distinguish which information is relevant or not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all of the relevant information to answer a question is located in one general spot in the document.”
The researchers will continue to add more evaluations to Michelangelo and hope to make them directly available so that other researchers can test their models on them.
Source link
Servers computers
【ANNSO】15" 8U Rack Mount Workstation Chassis
➥Features ➥
• Die-cast aluminum front panel with integrated 15 inch LCD screen
• Compact 8U height, rack mount workstation chassis
• Front panel, membrane keys and drive-bay enclosure meet IP 64 standard for tough environment (IP54 for mouse pad)
• Full function membrane keypad (USB interface) and front OSD controller
• Advanced thermal and air-flow design
• Analog VGA interface supports all CPU boards (DVI option)
➥Website ➥
www.annso.com
➥For corporation➥
fandy@annso.com.tw
andy@annso.com.tw
sales@annso.com.tw
➥Facebook➥
https://www.facebook.com/Annsotec/
source
Technology
Apple just released its first immersive movie for Vision Pro
Apple has just debuted the first scripted film captured in Apple Immersive Video and made specifically for the Vision Pro headset.
Designed to put viewers in the center of the action, the new movie, called Submerged, was written and directed by Academy Award-winning filmmaker Edward Berger.
The 17-minute immersive thriller takes viewers onto a WWII-era submarine and follows its crew as they attempt to deal with a harrowing attack. It offers viewers a 180-degree view, allowing you to explore your surroundings and follow the action wherever it takes place. The story features a lot of water and a lot of panic, so don’t even think about watching it in the bath.
“With Submerged … we’re excited to premiere the next generation of narrative filmmaking,” Apple executive Tor Myhren said in a release. “Vision Pro places you in the middle of the story — inside a densely packed submarine, shoulder to shoulder with its crew. That deep sense of immersion just wasn’t possible before, and we can’t wait to see how it inspires filmmakers to push the boundaries of visual storytelling.”
Berger, director of the Oscar-winning All Quiet on the Western Front, described Apple Immersive Video as “a wonderful new medium that expands the horizon of storytelling,” claiming that the technology pioneered by Apple will “change the future of filmmaking.”
Submerged was shot on location in Prague, Brussels, and Malta over three weeks, and Apple has released a behind-the-scenes look (see the video at the top of the article) showing how it was made. It was filmed using a full-scale 23-ton submarine set made with real steel, brass, and metal that was modeled after WWII-era vessels, and freedive training for actors was part of the preparation.
Apple said it’s preparing to release more content specifically for the Vision Pro, including an immersive short film of the 2024 NBA All-Star Weekend, and a Concert for One series aimed at “bringing fans closer to their favorite artists than ever before.”
Technology
Google Search testing a feature that shows full recipes in feed
For a while now, Google Search has been experimenting with ways to offer useful data directly in the results feed. The idea is that for certain tasks, they don’t have to open websites unless they want more in-depth details. In line with that, Google is testing a new feature that shows full recipes directly in Search.
Google wants to change the way you use Search by spending more time on the feed than before. That’s why the company relied on the power of artificial intelligence to develop the AI Overview feature. The latter didn’t turn out as expected, but the idea was pretty cool. Cooking recipes seem to be the next step in this direction.
Google Search shows full recipes in the feed for some websites as a test
Search Engine Roundtable reported that Google Search is testing a feature that allows it to display full recipes in the results feed. The option isn’t available for all recipe websites, though. On supported ones, you’ll see a “Quick View” button among the graphical results in the feed when you search for a particular recipe. For example, Preppy Kitchen, a cooking blog, is one of the platforms that supports the new feature. The “Quick View” button appears over featured images in the results for the “chocolate chip cookies recipe” search, as shown below.
You might perceive Google “pulling” data from websites as a potential loss of visits to those websites. However, a company spokesperson disclosed that partnerships with those recipe platforms enable the feature. This explains why the feature isn’t available on all of them. So, in theory, blogs receive fair compensation for the use of their content.
“We’re always experimenting with different ways to connect our users with high-quality and helpful information. We have partnered with a limited number of creators to begin to explore new recipe experiences on Search that are both helpful for users and drive value to the web ecosystem. We don’t have anything to announce right now,” Google spokesperson Brianna Duff told The Verge.
No date for the wide rollout yet
Given that this feature is currently undergoing testing, numerous changes may occur prior to its widespread implementation. The release date for the final version of recipes in the Google Search feed is still unknown.
Servers computers
Troca de Rack 12U x 16U – PARTE 1 #cabeamentoestruturado
Technology
Scrub down Shrek’s world in PowerWash Simulator
There are a ton of Shrek movies but not one of them have ever answered this question: Who cleans up the mess when the ogre and his various fairytale villains are done fighting? Square Enix’s PowerWash Simulator finally has an answer.
Dreamworks and Square Enix have teamed up to create the for PowerWash Simulator available now on all consoles and PC. The new pack adds a bunch of scenarios from the iconic animated films and some new armor and tools to help you scrub down the many layers of crud that have accumulated over Shrek’s world.
The new DLC pack comes with 5 new locations that need a good power washing. They include Shrek’s home swamp, the town of Duloc complete with that adorable wind-up information booth, the Fairy Godmother’s potion factory, the dragon’s lair and Hansel’s delectable Honeymoon Hideaway with the Shreks’ onion wedding carriage.
The Shrek Special Pack also offers a new campaign mode that takes you through the new scenes and grants you a new set of knight themed power washing armor and hoses. You’ll also receive messages from “some familiar faces,” maybe even the Muffin Man. (The Muffin Man!) Yes, the Muffin Man! (Actually, you probably won’t. He’s not really an ancillary character in the Shrek universe outside of the nursery rhyme reference from the first movie.)
One of the great things about PowerWash Simulator is just how crazy they’ve gone with the DLC packs. Square Enix has also developed special cleaning scenarios based on , and . The developers have been working on so many things to clean up that they’ve accidentally lost track of one and .
Servers computers
HP DL380 Rack Server
-
Science & Environment3 weeks ago
How to unsnarl a tangle of threads, according to physics
-
Science & Environment3 weeks ago
Hyperelastic gel is one of the stretchiest materials known to science
-
Womens Workouts2 weeks ago
3 Day Full Body Women’s Dumbbell Only Workout
-
Technology3 weeks ago
Would-be reality TV contestants ‘not looking real’
-
Science & Environment3 weeks ago
Maxwell’s demon charges quantum batteries inside of a quantum computer
-
Science & Environment3 weeks ago
‘Running of the bulls’ festival crowds move like charged particles
-
News4 weeks ago
the pick of new debut fiction
-
Science & Environment3 weeks ago
ITER: Is the world’s biggest fusion experiment dead after new delay to 2035?
-
Science & Environment3 weeks ago
How to wrap your mind around the real multiverse
-
Science & Environment3 weeks ago
Sunlight-trapping device can generate temperatures over 1000°C
-
Science & Environment3 weeks ago
Quantum ‘supersolid’ matter stirred using magnets
-
Science & Environment3 weeks ago
Liquid crystals could improve quantum communication devices
-
News3 weeks ago
Our millionaire neighbour blocks us from using public footpath & screams at us in street.. it’s like living in a WARZONE – WordupNews
-
Science & Environment3 weeks ago
Quantum forces used to automatically assemble tiny device
-
Science & Environment3 weeks ago
Why this is a golden age for life to thrive across the universe
-
Science & Environment3 weeks ago
Nerve fibres in the brain could generate quantum entanglement
-
Science & Environment3 weeks ago
Physicists are grappling with their own reproducibility crisis
-
Science & Environment3 weeks ago
Time travel sci-fi novel is a rip-roaringly good thought experiment
-
Science & Environment3 weeks ago
Laser helps turn an electron into a coil of mass and charge
-
Science & Environment3 weeks ago
Nuclear fusion experiment overcomes two key operating hurdles
-
Science & Environment2 weeks ago
X-rays reveal half-billion-year-old insect ancestor
-
Business2 weeks ago
Eurosceptic Andrej Babiš eyes return to power in Czech Republic
-
News4 weeks ago
▶️ Hamas in the West Bank: Rising Support and Deadly Attacks You Might Not Know About
-
Science & Environment3 weeks ago
Caroline Ellison aims to duck prison sentence for role in FTX collapse
-
News3 weeks ago
You’re a Hypocrite, And So Am I
-
Sport3 weeks ago
Joshua vs Dubois: Chris Eubank Jr says ‘AJ’ could beat Tyson Fury and any other heavyweight in the world
-
Science & Environment3 weeks ago
A slight curve helps rocks make the biggest splash
-
Technology2 weeks ago
Is sharing your smartphone PIN part of a healthy relationship?
-
News3 weeks ago
▶️ Media Bias: How They Spin Attack on Hezbollah and Ignore the Reality
-
Technology2 weeks ago
‘From a toaster to a server’: UK startup promises 5x ‘speed up without changing a line of code’ as it plans to take on Nvidia, AMD in the generative AI battlefield
-
Football2 weeks ago
Football Focus: Martin Keown on Liverpool’s Alisson Becker
-
News4 weeks ago
New investigation ordered into ‘doorstep murder’ of Alistair Wilson
-
Science & Environment3 weeks ago
Rethinking space and time could let us do away with dark matter
-
News3 weeks ago
The Project Censored Newsletter – May 2024
-
Technology2 weeks ago
Quantum computers may work better when they ignore causality
-
Business2 weeks ago
Should London’s tax exiles head for Spain, Italy . . . or Wales?
-
MMA2 weeks ago
Conor McGregor challenges ‘woeful’ Belal Muhammad, tells Ilia Topuria it’s ‘on sight’
-
Sport2 weeks ago
Watch UFC star deliver ‘one of the most brutal knockouts ever’ that left opponent laid spark out on the canvas
-
Science & Environment3 weeks ago
A new kind of experiment at the Large Hadron Collider could unravel quantum reality
-
Science & Environment3 weeks ago
Future of fusion: How the UK’s JET reactor paved the way for ITER
-
Technology2 weeks ago
Get ready for Meta Connect
-
Business1 week ago
Ukraine faces its darkest hour
-
Health & fitness3 weeks ago
The secret to a six pack – and how to keep your washboard abs in 2022
-
Science & Environment3 weeks ago
Why we need to invoke philosophy to judge bizarre concepts in science
-
Science & Environment3 weeks ago
A tale of two mysteries: ghostly neutrinos and the proton decay puzzle
-
Science & Environment3 weeks ago
UK spurns European invitation to join ITER nuclear fusion project
-
News3 weeks ago
Israel strikes Lebanese targets as Hizbollah chief warns of ‘red lines’ crossed
-
Health & fitness2 weeks ago
The 7 lifestyle habits you can stop now for a slimmer face by next week
-
Technology3 weeks ago
The ‘superfood’ taking over fields in northern India
-
CryptoCurrency3 weeks ago
Cardano founder to meet Argentina president Javier Milei
-
MMA3 weeks ago
Rankings Show: Is Umar Nurmagomedov a lock to become UFC champion?
-
News3 weeks ago
Why Is Everyone Excited About These Smart Insoles?
-
Science & Environment3 weeks ago
Meet the world's first female male model | 7.30
-
News3 weeks ago
Four dead & 18 injured in horror mass shooting with victims ‘caught in crossfire’ as cops hunt multiple gunmen
-
Womens Workouts2 weeks ago
3 Day Full Body Toning Workout for Women
-
Technology2 weeks ago
Robo-tuna reveals how foldable fins help the speedy fish manoeuvre
-
News4 weeks ago
How FedEx CEO Raj Subramaniam Is Adapting to a Post-Pandemic Economy
-
Health & fitness3 weeks ago
The maps that could hold the secret to curing cancer
-
Science & Environment3 weeks ago
Being in two places at once could make a quantum battery charge faster
-
CryptoCurrency3 weeks ago
Low users, sex predators kill Korean metaverses, 3AC sues Terra: Asia Express
-
Politics3 weeks ago
UK consumer confidence falls sharply amid fears of ‘painful’ budget | Economics
-
Womens Workouts3 weeks ago
Best Exercises if You Want to Build a Great Physique
-
Womens Workouts3 weeks ago
Everything a Beginner Needs to Know About Squatting
-
TV3 weeks ago
CNN TÜRK – 🔴 Canlı Yayın ᴴᴰ – Canlı TV izle
-
Science & Environment3 weeks ago
CNN TÜRK – 🔴 Canlı Yayın ᴴᴰ – Canlı TV izle
-
Servers computers2 weeks ago
What are the benefits of Blade servers compared to rack servers?
-
Technology2 weeks ago
The best robot vacuum cleaners of 2024
-
Business3 weeks ago
JPMorgan in talks to take over Apple credit card from Goldman Sachs
-
Science & Environment3 weeks ago
Quantum time travel: The experiment to ‘send a particle into the past’
-
CryptoCurrency3 weeks ago
Bitcoin miners steamrolled after electricity thefts, exchange ‘closure’ scam: Asia Express
-
CryptoCurrency3 weeks ago
Dorsey’s ‘marketplace of algorithms’ could fix social media… so why hasn’t it?
-
CryptoCurrency3 weeks ago
DZ Bank partners with Boerse Stuttgart for crypto trading
-
Science & Environment3 weeks ago
Most accurate clock ever can tick for 40 billion years without error
-
CryptoCurrency3 weeks ago
Bitcoin bulls target $64K BTC price hurdle as US stocks eye new record
-
Science & Environment3 weeks ago
How one theory ties together everything we know about the universe
-
News3 weeks ago
Church same-sex split affecting bishop appointments
-
Science & Environment3 weeks ago
Tiny magnet could help measure gravity on the quantum scale
-
CryptoCurrency3 weeks ago
Blockdaemon mulls 2026 IPO: Report
-
Sport3 weeks ago
UFC Edmonton fight card revealed, including Brandon Moreno vs. Amir Albazi headliner
-
Business3 weeks ago
Thames Water seeks extension on debt terms to avoid renationalisation
-
CryptoCurrency3 weeks ago
Ethereum is a 'contrarian bet' into 2025, says Bitwise exec
-
CryptoCurrency3 weeks ago
Coinbase’s cbBTC surges to third-largest wrapped BTC token in just one week
-
News2 weeks ago
US Newspapers Diluting Democratic Discourse with Political Bias
-
Politics3 weeks ago
Trump says he will meet with Indian Prime Minister Narendra Modi next week
-
CryptoCurrency3 weeks ago
Decentraland X account hacked, phishing scam targets MANA airdrop
-
Science & Environment3 weeks ago
Physicists have worked out how to melt any material
-
CryptoCurrency3 weeks ago
RedStone integrates first oracle price feeds on TON blockchain
-
CryptoCurrency3 weeks ago
‘No matter how bad it gets, there’s a lot going on with NFTs’: 24 Hours of Art, NFT Creator
-
Science & Environment3 weeks ago
How do you recycle a nuclear fusion reactor? We’re about to find out
-
Business3 weeks ago
How Labour donor’s largesse tarnished government’s squeaky clean image
-
Politics3 weeks ago
‘Appalling’ rows over Sue Gray must stop, senior ministers say | Sue Gray
-
Technology3 weeks ago
iPhone 15 Pro Max Camera Review: Depth and Reach
-
News3 weeks ago
Brian Tyree Henry on voicing young Megatron, his love for villain roles
-
News3 weeks ago
Brian Tyree Henry on voicing young Megatron, his love for villain roles
-
Travel2 weeks ago
Delta signs codeshare agreement with SAS
-
Politics2 weeks ago
Hope, finally? Keir Starmer’s first conference in power – podcast | News
-
CryptoCurrency3 weeks ago
Louisiana takes first crypto payment over Bitcoin Lightning
-
CryptoCurrency3 weeks ago
Crypto scammers orchestrate massive hack on X but barely made $8K
-
CryptoCurrency3 weeks ago
Telegram bot Banana Gun’s users drained of over $1.9M
-
CryptoCurrency3 weeks ago
SEC asks court for four months to produce documents for Coinbase
You must be logged in to post a comment Login