Connect with us

Technology

DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs

Published

on

DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Large language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks many possibilities for developers. 

But how well do these long-context LLMs really understand and utilize the vast amounts of information they receive?

Researchers at Google DeepMind have introduced Michelangelo, a new benchmark designed to evaluate the long-context reasoning capabilities of LLMs. Their findings, published in a new research paper, show that while current frontier models have progressed in retrieving information from large in-context data, they still struggle with tasks that require reasoning over the data structure.

Advertisement

The need for better long-context benchmarks

The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the focus has been on retrieval tasks, such as the popular “needle-in-a-haystack” evaluation, where the model is tasked with finding a specific piece of information within a large context.

“Over time, models have grown considerably more capable in long context performance,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For instance, the popular needle-in-a-haystack evaluation for retrieval has now been well saturated up to extremely long context lengths. Thus, it has become important to determine whether the harder tasks models are capable of solving in short context regimes are also solvable at long ranges.”

Retrieval tasks don’t necessarily reflect a model’s capacity for reasoning over the entire context. A model might be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that evaluate a model’s ability to reason over long contexts have limitations.

“It is easy to develop long reasoning evaluations which are solvable with a combination of only using retrieval and information stored in model weights, thus ‘short-circuiting’ the test of the model’s ability to use the long-context,” Vodrahalli said.

Advertisement

Michelangelo

To address the limitations of current benchmarks, the researchers introduced Michelangelo, a “minimal, synthetic, and unleaked long-context reasoning evaluation for large language models.” 

Michelangelo is based on the analogy of a sculptor chiseling away irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on evaluating the model’s ability to understand the relationships and structure of the information within its context window, rather than simply retrieving isolated facts.

The benchmark consists of three core tasks:

Latent list: The model must process a long sequence of operations performed on a Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “Latent List measures the ability of a model to track a latent data structure’s properties over the course of a stream of code instructions,” the researchers write.

Advertisement

Multi-round co-reference resolution (MRCR): The model must produce parts of a long conversation between a user and an LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even when the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understanding ordering in natural text, to distinguish between similar drafts of writing, and to reproduce a specified piece of previous context subject to adversarially difficult queries,” the researchers write.

“I don’t know” (IDK): The model is given a long story and asked to answer multiple-choice questions about it. For some questions, the context does not contain the answer, and the model must be able to recognize the limits of its knowledge and respond with “I don’t know.” “IDK measures the model’s ability to understand whether it knows what it doesn’t know based on the presented context,” the researchers write.

Latent Structure Queries

The tasks in Michelangelo are based on a novel framework called Latent Structure Queries (LSQ). LSQ provides a general approach for designing long-context reasoning evaluations that can be extended to arbitrary lengths. It can also test the model’s understanding of implicit information as opposed to retrieving simple facts. LSQ relies on synthesizing test data to avoid the pitfalls of test data leaking into the training corpus.

“By requiring the model to extract information from structures rather than values from keys (sculptures from marble rather than needles from haystacks), we can more deeply test language model context understanding beyond retrieval,” the researchers write.

Advertisement

LSQ has three key differences from other approaches to evaluating long-context LLMs. First, it has been explicitly designed to avoid short-circuiting flaws in evaluations that go beyond retrieval tasks. Second, it specifies a methodology for increasing task complexity and context length independently. And finally, it is general enough to capture a large range of reasoning tasks. The three tests used in Michelangelo cover code interpretation and reasoning over loosely written text.

“The goal is that long-context beyond-reasoning evaluations implemented by following LSQ will lead to fewer scenarios where a proposed evaluation reduces to solving a retrieval task,” Vodrahalli said.

Evaluating frontier models on Michelangelo

The researchers evaluated ten frontier LLMs on Michelangelo, including different variants of Gemini, GPT-4 and 4o, and Claude. They tested the models on contexts up to 1 million tokens. Gemini models performed best on MRCR, GPT models excelled on Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.

However, all models exhibited a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have room to improve in their ability to reason over large amounts of information.

Advertisement
long-context reasoning
Frontier LLMs struggle with reasoning on long-context windows (source: arxiv)

“Frontier models have room to improve on all of the beyond-retrieval reasoning primitives (Latent List, MRCR, IDK) that we investigate in Michelangelo,” Vodrahalli said. “Different frontier models have different strengths and weaknesses – each class performs well on different context ranges and on different tasks. What does seem to be universal across models is the initial drop in performance on long reasoning tasks.”

The Michelangelo evaluations capture basic primitives necessary for long-context reasoning and the findings can have important implications for enterprise applications. For example, in real-world applications where the model can’t rely on its pretraining knowledge and must perform multi-hop reasoning over many disparate locations in very long contexts, Vodrahalli expects performance to drop as the context length grows.

“This is particularly true if the documents have a lot of information that is irrelevant to the task at hand, making it hard for a model to easily immediately distinguish which information is relevant or not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all of the relevant information to answer a question is located in one general spot in the document.”

The researchers will continue to add more evaluations to Michelangelo and hope to make them directly available so that other researchers can test their models on them.


Source link
Continue Reading
Advertisement
Click to comment

You must be logged in to post a comment Login

Leave a Reply

Servers computers

【ANNSO】15" 8U Rack Mount Workstation Chassis

Published

on

【ANNSO】15" 8U Rack Mount Workstation Chassis



➥Features ➥
• Die-cast aluminum front panel with integrated 15 inch LCD screen
• Compact 8U height, rack mount workstation chassis
• Front panel, membrane keys and drive-bay enclosure meet IP 64 standard for tough environment (IP54 for mouse pad)
• Full function membrane keypad (USB interface) and front OSD controller
• Advanced thermal and air-flow design
• Analog VGA interface supports all CPU boards (DVI option)

➥Website ➥
www.annso.com

➥For corporation➥
fandy@annso.com.tw
andy@annso.com.tw
sales@annso.com.tw

➥Facebook➥
https://www.facebook.com/Annsotec/

source

Continue Reading

Technology

Apple just released its first immersive movie for Vision Pro

Published

on

Apple just released its first immersive movie for Vision Pro

The Making of Submerged | Apple Vision Pro

Apple has just debuted the first scripted film captured in Apple Immersive Video and made specifically for the Vision Pro headset.

Designed to put viewers in the center of the action, the new movie, called Submerged, was written and directed by Academy Award-winning filmmaker Edward Berger.

The 17-minute immersive thriller takes viewers onto a WWII-era submarine and follows its crew as they attempt to deal with a harrowing attack. It offers viewers a 180-degree view, allowing you to explore your surroundings and follow the action wherever it takes place. The story features a lot of water and a lot of panic, so don’t even think about watching it in the bath.

“With Submerged … we’re excited to premiere the next generation of narrative filmmaking,” Apple executive Tor Myhren said in a release. “Vision Pro places you in the middle of the story — inside a densely packed submarine, shoulder to shoulder with its crew. That deep sense of immersion just wasn’t possible before, and we can’t wait to see how it inspires filmmakers to push the boundaries of visual storytelling.”

Advertisement

Berger, director of the Oscar-winning All Quiet on the Western Front, described Apple Immersive Video as “a wonderful new medium that expands the horizon of storytelling,” claiming that the technology pioneered by Apple will “change the future of filmmaking.”

Submerged was shot on location in Prague, Brussels, and Malta over three weeks, and Apple has released a behind-the-scenes look (see the video at the top of the article) showing how it was made. It was filmed using a full-scale 23-ton submarine set made with real steel, brass, and metal that was modeled after WWII-era vessels, and freedive training for actors was part of the preparation.

Apple said it’s preparing to release more content specifically for the Vision Pro, including an immersive short film of the 2024 NBA All-Star Weekend, and a Concert for One series aimed at “bringing fans closer to their favorite artists than ever before.”






Source link

Advertisement

Continue Reading

Technology

Google Search testing a feature that shows full recipes in feed

Published

on

Google Search testing a feature that shows full recipes in feed

For a while now, Google Search has been experimenting with ways to offer useful data directly in the results feed. The idea is that for certain tasks, they don’t have to open websites unless they want more in-depth details. In line with that, Google is testing a new feature that shows full recipes directly in Search.

Google wants to change the way you use Search by spending more time on the feed than before. That’s why the company relied on the power of artificial intelligence to develop the AI ​​Overview feature. The latter didn’t turn out as expected, but the idea was pretty cool. Cooking recipes seem to be the next step in this direction.

Google Search shows full recipes in the feed for some websites as a test

Search Engine Roundtable reported that Google Search is testing a feature that allows it to display full recipes in the results feed. The option isn’t available for all recipe websites, though. On supported ones, you’ll see a “Quick View” button among the graphical results in the feed when you search for a particular recipe. For example, Preppy Kitchen, a cooking blog, is one of the platforms that supports the new feature. The “Quick View” button appears over featured images in the results for the “chocolate chip cookies recipe” search, as shown below.

You might perceive Google “pulling” data from websites as a potential loss of visits to those websites. However, a company spokesperson disclosed that partnerships with those recipe platforms enable the feature. This explains why the feature isn’t available on all of them. So, in theory, blogs receive fair compensation for the use of their content.

Advertisement

“We’re always experimenting with different ways to connect our users with high-quality and helpful information. We have partnered with a limited number of creators to begin to explore new recipe experiences on Search that are both helpful for users and drive value to the web ecosystem. We don’t have anything to announce right now,” Google spokesperson Brianna Duff told The Verge.

No date for the wide rollout yet

Given that this feature is currently undergoing testing, numerous changes may occur prior to its widespread implementation. The release date for the final version of recipes in the Google Search feed is still unknown.

Source link

Continue Reading

Servers computers

Troca de Rack 12U x 16U – PARTE 1 #cabeamentoestruturado

Published

on

Troca de Rack 12U x 16U - PARTE 1  #cabeamentoestruturado

source

Continue Reading

Technology

Scrub down Shrek’s world in PowerWash Simulator

Published

on

Menu

There are a ton of Shrek movies but not one of them have ever answered this question: Who cleans up the mess when the ogre and his various fairytale villains are done fighting? Square Enix’s PowerWash Simulator finally has an answer.

Dreamworks and Square Enix have teamed up to create the for PowerWash Simulator available now on all consoles and PC. The new pack adds a bunch of scenarios from the iconic animated films and some new armor and tools to help you scrub down the many layers of crud that have accumulated over Shrek’s world.

The new DLC pack comes with 5 new locations that need a good power washing. They include Shrek’s home swamp, the town of Duloc complete with that adorable wind-up information booth, the Fairy Godmother’s potion factory, the dragon’s lair and Hansel’s delectable Honeymoon Hideaway with the Shreks’ onion wedding carriage.

The Shrek Special Pack also offers a new campaign mode that takes you through the new scenes and grants you a new set of knight themed power washing armor and hoses. You’ll also receive messages from “some familiar faces,” maybe even the Muffin Man. (The Muffin Man!) Yes, the Muffin Man! (Actually, you probably won’t. He’s not really an ancillary character in the Shrek universe outside of the nursery rhyme reference from the first movie.)

One of the great things about PowerWash Simulator is just how crazy they’ve gone with the DLC packs. Square Enix has also developed special cleaning scenarios based on , and . The developers have been working on so many things to clean up that they’ve accidentally lost track of one and .

Advertisement

Source link

Continue Reading

Servers computers

HP DL380 Rack Server

Published

on

HP DL380 Rack Server

source

Continue Reading

Trending

Copyright © 2024 WordupNews.com