OpenAI unveils a model that can fact-check itself

Written by
Kyle Wiggers
Published on
Sept. 12, 2024, 5:20 p.m.

ChatGPT maker OpenAI has announced its next major product release: An generative AI model code-named Strawberry, officially called OpenAI o1.

To be more precise, o1 is actually a family of models. Two are available today: o1-preview and a smaller, cheaper model, o1 mini, in both chatbot form and via OpenAI’s API. You’ll have to be subscribed to ChatGPT Plus or Team; Enterprise and Edu users will get access early next week.

Note that the o1 chatbot is fairly barebones at the moment; unlike ChatGPT, o1 can’t browse the web or analyze files (yet). And the models aren’t cheap. In the API, o1-preview is $15 per 1 million tokens (about 750,000 words) put in to the model and $60 per 1 million tokens generated by the model. For comparison, GPT-4o costs $5 per 1 million input tokens and $15 per 1 million output tokens.

OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn’t set a release date yet.

o1 avoids some of the reasoning pitfalls that normally trip up generative AI models, at least according to OpenAI. That’s because o1 can effectively fact-check itself by spending more time considering all parts of a command or question.

OpenAI says that o1, originating from an internal company project known as Q* , is particularly adept at solving math and programming-related challenges. But what makes the text-only o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries.

When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help it arrive at answers. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.

“o1 is trained with reinforcement learning to ‘think’ before responding via a private chain of thought,” Noam Brown, a research scientist at OpenAI, said in a series of tweets. “The longer it thinks, the better it does on reasoning tasks.” This opens up a new dimension for scaling.””

TechCrunch wasn’t offered the opportunity to test o1 before its debut; we aim to get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g. GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.

“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”

In a qualifying exam for the International Mathematics Olympiad, a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, OpenAI claims. The company also says that o1 should perform better on problems in science and coding.

Now, there is a downside. o1 can be slower than other models, query depending; Arredondo tells us the model can take over ten seconds to answer some questions. (Helpfully, the chatbot version of o1 shows its progress by displaying a label for the current subtask it’s performing.)

Given the unpredictable nature of generative AI models, o1 likely has other flaws and limitations (Brown admitted that o1 also trips up on games of tic-tac-toe, for example). We’ll no doubt learn about these in time — and once we get a chance to test the model ourselves.

We’d be remiss if we didn’t point out that OpenAI is far from the only AI vendor investigating these types of reasoning methods to improve model factuality. Google DeepMind researchers recently published a study showing that, by essentially giving models more compute time and guidance to fulfill requests as they’re made, the performance of those models can be significantly improved without any additional tweaks.

OpenAI might be first out of the gate with o1. But assuming rivals soon follow suit with comparable models, the company’s real test will be making o1 widely available at a reasonable price.

Weekly newsletter
No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.
Read about our privacy policy .
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.