Close Menu
Decapitalist

    Subscribe to Updates

    Get the latest creative news from Decapitalist about Politics, World News and Business.

    Please enable JavaScript in your browser to complete this form.
    Loading
    What's Hot

    Top Summer Travel Essentials

    July 22, 2025

    Chris Sails Speaks Out Against Rumors About His Sexuality

    July 22, 2025

    My Healthy Sleep Stack: What I Use for Deep, Restorative Rest

    July 22, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Decapitalist
    • Home
    • Business
    • Politics
    • Health
    • Fashion
    • Lifestyle
    • Sports
    • Technology
    • World
    • More
      • Fitness
      • Education
      • Entrepreneur
      • Entertainment
      • Economy
      • Travel
    Decapitalist
    Home»Technology»A Chinese firm has just launched a constantly changing set of AI benchmarks
    Technology

    A Chinese firm has just launched a constantly changing set of AI benchmarks

    Decapitalist NewsBy Decapitalist NewsJune 23, 2025003 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email Telegram WhatsApp
    Follow Us
    Google News Flipboard
    A Chinese firm has just launched a constantly changing set of AI benchmarks
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

    Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

    Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

    DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

    On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is.

    The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

    To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.

    The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

    ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

    “It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”



    Source link

    benchmarks changing Chinese constantly Firm launched set
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    arthur.j.wagner
    Decapitalist News
    • Website

    Related Posts

    Venus Williams, 45, set for return to tennis at Citi Open

    July 21, 2025

    Google Teases Pixel 10 Ahead of August Reveal

    July 21, 2025

    The Bills That Could Change Crypto in The U.S.

    July 20, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Billy Joel cancels all tour dates after brain disorder diagnosis

    May 24, 202530 Views

    Diddy trial: Ex-employee testifies about rapper’s violent ‘attacks’ on Cassie Ventura – National

    May 30, 202520 Views

    Harvey Weinstein case judge declares mistrial on remaining rape charge – National

    June 13, 202512 Views
    Don't Miss

    Strike cripples Karachi, Lahore in protest against ‘anti-business’ tax measures

    July 21, 2025 Business 04 Mins Read0 Views

    Pakistan’s two major commercial hubs, Karachi and Lahore, came to a near standstill on Saturday…

    Indian-Origin Trapit Bansal, Hammad Syed Among 44 Picked For Meta’s Superintelligence Unit | Business News

    July 20, 2025

    India’s Startup Boom: Nearly 76,000 Run By Women, Says Minister | Economy News

    July 19, 2025

    Stocks climb as market is buoyed by Trump’s decision not to fire Powell

    July 18, 2025
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About Us

    Welcome to Decapitalist — a post-capitalist collective dedicated to delivering incisive, critical, and transformative political journalism. We are a platform for those disillusioned by traditional media narratives and seeking a deeper understanding of the systemic forces shaping our world.

    Most Popular

    Top Summer Travel Essentials

    July 22, 2025

    Chris Sails Speaks Out Against Rumors About His Sexuality

    July 22, 2025

    Subscribe to Updates

    Please enable JavaScript in your browser to complete this form.
    Loading
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    Copyright© 2025 Decapitalist All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.