nexos.ai raises €30M Series A to accelerate enterprise AI adoption. Read full announcement →

Claude vs ChatGPT: What is the difference, and which one to use in 2025?

Since its launch in November 2022, OpenAI’s ChatGPT has been the go-to AI model for virtually any task—until Anthropic’s Claude arrived. Now, these two AI assistants fiercely compete for the leader’s position, with neither consistently outperforming the other in all areas. Therefore, we’re making a detailed comparison of Claude and ChatGPT to find each model’s strengths and weaknesses and declare the winner.

In this article, we’ll be testing these prime examples of AI technology in key categories, such as creative writing, coding, and problem solving, among others. Then, we’ll pick the winner in each and crown the best artificial intelligence to use in 2025, at least for the moment.

We’ll be comparing each company’s latest and most advanced models: ChatGPT 4.1 vs Sonnet 4 and ChatGPT o3 vs Opus 4. The latter pair is the most powerful, while the former offers better speeds.

Claude vs ChatGPT: What is the difference, and which one to use in 2025?

7/7/2025

27 min read

First look at Claude and ChatGPT

Both Claude and ChatGPT are great large language models (LLMs), but their main strengths lie in different areas. Claude is recommended for coding and creative writing due to its more natural-sounding style. However, ChatGPT is more versatile and offers extra features, such as image and video generation.

Here are the main benefits and drawbacks of ChatGPT and Claude discussed in more detail.

Benefits of Claude

The key benefit of Claude is its natural writing style. Its texts are more coherent and natural-flowing instead of being an amalgamation of paragraphs taken from different sources. Claude is also great for coding – it can give real-time feedback and visualize the end results with its Artifacts feature.

Drawbacks of Claude

The main drawbacks of Claude include its lack of features, namely image and video generation along with voice output. What’s more, Claude falls short in high processing functions, problem solving, and explanation. Finally, it could benefit from a larger context window.

Benefits of ChatGPT

The main benefit of ChatGPT is its versatility – it might not be the best in areas like writing, but it’s good enough in most cases. Where ChatGPT really shines is the number of features – you can have a voice conversation with it, create videos, images, and even your own specialized GPTs. Finally, all of this comes at a reasonable price, especially if you choose the API option.

Drawbacks of ChatGPT

ChatGPT’s versatility is also its main drawback. While it’s good for writing and coding, it’s not the best LLM for these tasks. For instance, it may sound robotic, with no artistic touch. Its code can also be far from optimal. 

Claude vs. ChatGPT: a quick feature and pricing comparison

Claude ChatGPT

Company

AnthropicOpenAI

Latest AI models

Sonnet 4, Opus 4ChatGPT 4.1, ChatGPT o3

Context window

Sonnet 4: 200,000 tokens (about 150,000 words) Opus 4: 200,000 tokens (about 150,000 words)4.1: 1,047,576 tokens (about 786,000 words) o3: 200,000 tokens (about 150,000 words)

Maximum output

Sonnet 4: 64,000 tokens (about 48,000 words) Opus 4: 32,000 tokens (about 24,000 words)4.1: 32,768 tokens (about 24,600 words) o3: 100,000 tokens (about 75,000 words)

Training data cut-off

Sonnet 4: 2025 March Opus 4: 2025 March4.1: 2024 June o3: 2024 June

Web search

Sonnet 4: Yes Opus 4: Yes4.1: Yes o3: Yes

Image generation

NoYes (GPT Image 1)

Video generation

NoYes (Sora)

Voice output

Sonnet 4: No Opus 4: NoYes

Input types

Sonnet 4: Text, image Opus 4: Text, image4.1: Text, voice, image o3: Text, voice, image

Available on free plan:

Sonnet 4: No Opus 4: No4.1: No o3: No

Pricing (user)

Pro – $17/month Max – from $100/monthPlus – $20/month Pro – $200/month

Pricing (teams)

Team – $25/user/month if paid annually, $30/user/month if paid monthly, minimum 5 usersTeam – $25/user//month if paid annually, $30/user/month if paid monthly, minimum 2 users

Pricing (API)

Sonnet 4: $3 per 1M input tokens $0.30 per 1M caching input $15 per 1M output tokens Opus 4: $15 per 1M input tokens $1.50 per 1M caching input $75 per 1M output tokens4.1: $2 per 1M input tokens $0.50 per 1M caching input $8 per 1M output tokens o3: $2 per 1M input tokens $0.50 per 1M caching input $8 per 1M output tokens

The data in the table above already shows some key differences between ChatGPT and Claude:

  • ChatGPT 4.1 offers a context window that is more than five times larger. This is crucial when working with large datasets or documents.
  • ChatGPT o3 has the biggest maximum output. This is essential for creating large datasets or documents.
  • The input-output ratio is heavily in ChatGPT o3's favor. It’s around 2:1 for o3, 3:1 for Sonnet 4, 6:1 for Opus 4, and a whopping 32:1 for 4.1. The latter stat means that if you want your debut epic fantasy saga manuscript to be edited, you’ll have to run it at least thirty-two times and stitch it together manually.
  • Claude has much more recent data, with the cut-offs made in March 2025. However, this advantage can be virtually negated by using the web search feature.
  • Only ChatGPT can generate images and videos. This is a big plus if you need an all-around tool.
  • Only ChatGPT offers voice chat. You can select from different voices and have a conversation that’s later transcribed.
  • ChatGPT’s Team plan requires just two users ($50/month), compared to Claude’s five ($125/month).
  • ChatGPT gives more bang for almost the same amount of buck, unless you have no use for image and video generation or voice input and output.
  • ChatGPT’s API is considerably cheaper, especially when it comes to output tokens.

To sum up, ChatGPT offers more for the same (or even better) price. However, if you have zero interest in creative tasks, such as image and video generation, Claude may be a great option as well. That is, unless you need a much larger context window and output. If you'd like to learn more about various stages of ChatGPT, you can take a look at our guide on ChatGPT evolution.

We also want to point out that it’s possible to further reduce AI usage costs with the help of AI orchestration platforms, such as nexos.ai. They can help limit budgets and save tokens with caching.

Please also note that we’re comparing quantity over quality at this stage. Read on to learn how each model fares in different tasks as we put them through a series of meticulously designed tests.

Model capabilities: benchmark results

In this section, we’ll talk about what you can and can’t do with each model. Also, we’ll check their performance based on popular AI benchmarks that test how good a certain tool is at different tasks, such as coding or writing tasks.

Language understanding and generation benchmark results

ChatGPT offers a much larger context window (786,000 words for 4.1) and output (75,000 words for o3), which is crucial when working with huge documents. 

However, Claude’s texts feel more human, meaning this AI is better for writing. 

According to the LiveBench language test with questions last updated on April 25, 2025, Opus 4 (76.11) and o3 with high “reasoning effort” (High) settings (76.00) are the clear leaders, leaving Sonnet 4 in “thinking mode” (Thinking) behind with 70.19. Much to our surprise, ChatGPT 4.1 scored just 54.55.

Coding benchmark results

Claude and ChatGPT can be really handy for writing or debugging code. But which of the four AI assistants is the go-to choice?

Or tests of real-world applications described later didn’t show a big discrepancy between them, but LiveBench tells another story. According to it, o3 with medium reasoning (Medium) (77.86) and Sonnet 4 (77.54) led the way, with Opus 4 Thinking taking third place (73.25). Once more, ChatGPT 4.1 was the last, although not by much (73.19).

Reasoning benchmark

Reasoning is an important overall capability of an AI, showing whether it can solve different tasks requiring logic, “common sense,” and so on. In our test, this was most evident in fiction writing, where all four models failed to avoid logical inconsistencies. We named Sonnet 4 and ChatGPT o3 the winners of the first and second place, respectively.

LiveBench showed similar results: Sonnet 4 Thinking (95.25) and o3 High (93.33) scored the highest. Opus 4 Thinking was third with (90.47), and ChatGPT 4.1 lagged way behind with 44.39.

Data analysis benchmark results

We didn’t run a special test for data analysis, such as combining or reformatting tables and similar complex tasks, because LiveBench showed no significant difference between Claude and ChatGPT.

The Opus 4 Thinking (70.73) and Sonnet 4 Thinking (69.84) lead the way, but o3 Medium (68.19) and 4.1 (66.40) weren’t far away.

Mathematics benchmark results

We didn’t test mathematics ourselves and let LiveBench be the judge. The test included questions from high school math competitions, such as AMC12 and USAMO.

Claude’s models took the stage again, with Opus 4 Thinking (88.25) and Sonnet 4 Thinking (85.25) scoring the highest. But the o3 High was right behind (85.00), which cannot be said about the 4.1 (62.39).

Other capabilities 

While the 4.1 model was often left behind in LiveBench tests, our own test results didn’t show such a big difference. Plus, we shouldn’t forget that Claude’s AI is limited to text output, whereas ChatGPT can give you images and videos.

Also, Claude AI models allow voice input and output, which can be really convenient, especially for mobile users and accessibility needs.

Ease of use and user experience

Both Claude and ChatGPT are easy to use, providing an intuitive and uncluttered user interface. You can access them either via a web interface or API access.

However, each comes with pros and cons that may matter to some users.

ChatGPT has a big usability advantage due to its voice mode. You can choose from 9 distinct male and female voices to find the one that’s most pleasing to your ear. There’s even a selection of British accents!

ChatGPT voice mode voice selection

It can do a web search just like the text version. After finishing the conversation, you’ll find its transcript in the main window.

ChatGPT voice mode transcript

The main chat interfaces have a text input field and some extra options. ChatGPT lets you add photos or files from your computer. Alternatively, you can connect your Google or Microsoft Drive. In contrast, Claude has a “take a screenshot” feature and allows Google Drive and GitHub uploads. The latter will be nice for developers.

ChatGPT’s tools option gives “create an image”, “search the web”, “write or code”, and “run deep research”.

ChatGPT tools drop-down menu

The logic of Claude’s “Search and tools” button is a bit different. Upon pressing it, you can toggle web search and extended thinking, run a query in Drive, Gmail, or Calendar, and add integrations.

Claude tools drop-down menu

Furthermore, ChatGPT gives three themes: system, dark, and light. In addition to those, Claude also lets you choose a default, system, or dyslexia-friendly font.

Write mode

With ChatGPT, clicking “Tools” and “Write or code” puts you in a canvas mode, moving the prompt area to the left. On the bottom right, you get a specialized “Suggested edits” toolbox with options such as changing the reading level, adjusting the length, or adding emojis. However, as it expands only after hovering your mouse over it, you can easily miss it on larger screens, as there are no other elements around it.

ChatGPT canvas: Text before and after changing the reading level from “college” to “high school.” 1ChatGPT canvas: Text before and after changing the reading level from “college” to “high school.” 2

On the top right, you can see changes, revert to the last version, and copy or share text.

In the meantime, Claude’s “Write” button works differently. It offers options like “Develop podcast scripts” or “Write compelling CTAs.” When ChatGPT leaves a narrow left side, it splits the screen in half, but both can be adjusted to your taste.

Claude Artifacts for writing

Claude’s “canvas” writing mode, Artifacts (when content appears in a dedicated window alongside the conversation), doesn’t offer specialized tools like ChatGPT did. Your only options are to copy, save as markdown, PDF, or publish, making the piece available to anyone on the web.

Moreover, Claude lets you customize the chatbot's response style. It can be Normal, Concise, Explanatory, or Formal. You can even create a version tailored to specific needs.

Claude chatbot response style selection

Coding mode

When using ChatGPT’s canvas for code, the tools on the bottom right change to code review, port to language (PHP, C++, Python, JavaScript, TypeScript, and Java), fix bugs, add logs, and add comments.

ChatGPT canvas for code before and after adding comments.

ChatGPT canvas for code before and after adding comments 1ChatGPT canvas for code before and after adding comments 2

Just like writing, clicking on Claude’s code button also gives suggestions like “Create technical specifications” or “Develop coding standards.” Again, it doesn’t have predetermined “tools” like ChatGPT, but it comes with a big plus—it allows you to switch between the code and its implementation preview.

Claude canvas for coding

That being said, ChatGPT has its separate and comprehensive Codex mode, a software engineering agent, which can be connected to GitHub and used to work with your code.

Claude’s other buttons

Claude also has “Learn” and “Life stuff” buttons that offer similar suggestions but don’t make much sense as they don’t act as new “modes”—you can click the suggestions, and that’s basically it.

Claude main interface screen

Claude’s Research mode

Claude also has a beta research mode to add your data to the web search. According to Anthropic, this mode makes Claude operate as an agent, conducting multiple searches to build on each other and determining what to research next. The end result comes with easy-to-check citations.

We tested Opus 4 in regular and “research” modes to see if there was any noticeable difference. For that, we used a simple but specific prompt: Find the five most common cyber attack types on organizations in the UK in 2025.

The difference was already visible in the number of sources scanned. The regular search engine used 20 sources and had this Top 5: phishing, impersonation, ransomware, DDoS, and supply chain. However, it didn’t say how many organizations experienced impersonation attacks.

Claude research mode

The Research took over 10 minutes and checked over 220 sources. It gave the Top 5 according to the frequency of each cyberattack, and was different from the simple answer. It was phishing, business email compromise, malware infections, account takeovers, and ransomware. It also prepared a detailed report with sources. 

Claude advanced research mode

Overall, we see ChatGPT as the winner in ease-of-use and usability. It has more extra features you can integrate, offers specialized tools to improve your writing and code, and has a vast set of custom GPTs that can ease your work. In other words, with OpenAI, you get the whole ecosystem of AI tools, while Anthropic is just starting to build around the base service.

Extra features

This one is a no-brainer—ChatGPT offers way more additional features than Claude. It all starts with image and video generation and continues with voice input and output.

Furthermore, it has plenty of custom GPTs for specific uses, such as coding in Python or a grammar checker. 

ChatGPT also supports 37 languages, and Claude still has this in beta, with 11 options at the moment. You can access Claude in 176 countries, while ChatGPT is available in 185, although their list was updated a year ago.

When it comes to security, ChatGPT offers multi-factor authentication (MFA). With Claude, you can use it only by logging in with an email that supports MFA, such as Gmail.

Integration and customization

When it comes to integration, Claude offers more opportunities than ChatGPT. It supports adding popular Google apps (Drive, Gmail, Calendar), GitHub, and allows custom integrations.

Meanwhile, ChatGPT gives you Google Drive, Microsoft Drive, and GitHub (beta).

However, ChatGPT is a clear winner in the customization field. You can create custom GPTs or use those that are already available. Among them, you can find SciSpace for searching 278M+ research papers, a writing coach, a brand logo designer, and more.

ChatGPT lets you personalize the model use with custom instructions, where you can give the 

model traits and tell what it should know about you. 

You can also choose if it saves anything about you in its memory and uses that information to answer future requests.

On the other hand, Claude offers more options related to text output. As previously mentioned, you can choose the answer to be normal, concise, explanatory, or formal. Also, it lets you create and edit your own styles.

Claude also lets you select your work function to better suit your needs and offers a beta version of storing your preferences. There’s also a feature preview of an Analysis tool that can write and run code to process information, run analysis, and produce data visualizations in real time.

Overall, ChatGPT wins in terms of integration and customization. While Claude is better at the former, the latter gives way more flexibility for different users.

Creative writing 

Claude excels in weaving words, at least by its reputation. After testing, this turned out to be true, although the difference was not that big. In fact, o3 managed to outscore Opus 4, but Sonnet 4 saved the day for Anthropic. You can find the input and output for this and other tests in this PDF document.

Writing a funny story about saving a princess turned out to be one of the most complex tasks for the contestants. The main criteria were the quality of the humour (yes, that’s pretty subjective), the logic of the story, and an ending with a strong punchline.

Sonnet 4’s story was more elegantly written, had pretty good jokes, and just one logical fallacy, while o3 also had only one but lacked a proper punchline. Also, we noticed that Sonnet 4 gave a considerably shorter story of 422 words than the rest of the models when the requirement was to aim for around 500.

In the meantime, Opus 4 and 4.1 prose had more than one logic issue, but the former’s jokes were a bit better.

Find out more about the task and the results in the test section for creative writing below.

Coding

At first, it seemed that ChatGPT 4.1 did the best job in our coding task, which was creating a simple eshop. It was the only model to use React instead of plain HTML, CSS, and JavaScript.

However, when asked to ditch React, 4.1 delivered the worst result, evaluated just 78/100 by Gemini on Google AI Studio. 

Therefore, Sonnet 4 was the winner both design-wise and code-wise. The table below summarizes the results.

Find out more about the task and the results in the test section for coding below.

Disclaimer: the author of this article is not a developer, which is why the code was evaluated with Gemini. Also, we tested only the front-end part and will be adding a more in-depth comparison in the future.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Code React: 91/100 HTML: 78/100 Average: 85/100 3rdHTML: 88/100 2nd HTML: 90/100 1st HTML: 85/100 3rd
UI/UX 2nd4th1st3rd
Total 2nd3rd2st3rd

High processing functions

For our high processing functions test, we asked ChatGPT and Claude to create a budget for an on-budget stag party in LA and explain the reasoning.

Most of the plans were pretty similar, offering pizzas, hamburgers, 60 to 90 cans of beer, up to 2.5 liters of spirits, and some fun activities. 

However, Claude’s suggestions were based on assumptions that the stags will have a backyard with a grill and a big table for a 10-player poker game, which is already above the recommended nine-player limit. 

ChatGPT o3 offered mini-golf, which is also not the best option for a large company. Therefore, 4.1 with its Costco pizza and bowling won the laurels.

Find out more about the task and the results in the test section for high processing functions below.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Place 1st2nd3rd4th

Problem solving and analysing

In the problem solving and analysing test, we asked our AI assistants to troubleshoot a Windows laptop that became sluggish for no apparent reason.

Here, ChatGPT o3 was the clear winner. Not only did it offer the most possible causes (9), it did that in a very structured way, ending its answer with an easy-to-follow decision tree. It also added extra tips on what to do if even reinstalling Windows doesn’t help.

Probably the best thing about the o3 answer was that it used web search, without being told to, and found possible recent causes, such as an issue with the Windows Startup Boost released in May 2025.

4.1 did pretty well too, with seven possible causes, but they were pretty general and briefly described.

Sonnet 4’s answer was similar to that of 4.1 – it gave more (9) possible causes but failed to describe how to troubleshoot further with its recommended Linux USB.

Lastly, Opus 4 gave seven causes and six tips on what to try after. However, including “disabling visual effects” is not a serious suggestion for a mid-tier laptop struggling with performance. It also didn’t offer what to do if a Windows reinstall doesn’t help.

Find out more about the task and the results in the test section for problem solving and analysing below.

Explanation

In this test, we asked to explain the string theory in layman’s terms and provide simple analogies where possible. Once more, o3 turned out to be the winner. It offered a well-structured and easy-to-understand explanation that involved not only explanations of “strings” but also other concepts, such as branes or super-symmetry. 

Sonnet 4 spitted out the most comprehensive explanation, but explained fewer elements of the theory. 

Finally, Opus 4 and ChatGPT 4.1 shared the 3rd spot because each left some key points poorly explained or without a proper analogy.

Find out more about the task and the results in the test section for explanation below.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Place 3rd1st2nd3rd

Image recognition and description

To determine the champion of image analysis, we showed an image of a phonograph generated with Visual Electric.

All four AI assistants had no problems with this task. The answers were clear and based on facts without notable differences. They also correctly recognized the image as generated by an AI.

However, we decided to give the gold medal to o3 for slightly more complete image analysis.

Find out more about the task and the results in the test section for image recognition and description below.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Place 2nd1st2nd2nd

Image generation 

ChatGPT is the automatic winner in this category since Claude has yet to offer this feature. The result according to our prompt was good, although the AI missed one key point in our prompt – a golden tooth. Find out more about the task and the results in the test section for image generation below.

ChatGPT automatically won this because Claude doesn’t have a video generation capability yet. To be fair, free ChatGPT doesn’t have it either (check out the ChatGPT free vs paid article for a detailed version comparison).

Image generation prompts

Even though Anthropic’s models cannot create images, one way to compare Claude vs ChatGPT in this area was to see how good their prompts for an image-generating AI could be. We described what we wanted and fed it to our four chatbots. 

We didn’t subtract many points if the image generator (we used Visual Electric) didn’t do its best, as the point here was to get the best prompt, not the best image, which can depend a lot on the AI you’re using.

Here, o3 shined, providing a much more detailed description than the other three. It even included technical details of the camera that the AI should simulate, such as an 85mm lens. Additionally, it offered style keywords and negative prompts, which can help a lot when using a proper tool.

4.1 used the most simplistic language, which may be why the resulting image was the closest to what we wanted. On the other hand, it was the first iteration, so the AI might have tried to “upgrade” what it had already done when we tested the other three models.

Sonnet 4 and Opus 4 used more elaborate explanations than 4.1, but they were also relatively brief. However, Opus 4 gave a better style description, finishing just ahead of its less powerful brother-model (or model-brother).

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Place 4th1st3rd2nd

Video generation

ChatGPT automatically won this because Claude doesn’t have a video generation capability yet. 

We gave Sora, Open AI’s video creation tool, a pretty simple prompt to create a scene from an underground dance club, and it did a good job. While the people in the video didn’t look very realistic, we got all the requested details save for the “DJ nodding to the rhythm.”

Compared to Sora, Google Gemini has a way better video generator in Veo 3, so we’ll probably use it in the future.

Find out more about the task and the results in the test section for video generation below.

Video generation prompts

Just like in the image test, we checked how all four models can write a video generation prompt. Then, we asked Sora to create a 5-second video of a girl riding a bicycle in the 1920s, according to each.

The quality of the prompt was the main criterion, so even when the video turned out to be a little (or pretty) weird, we didn’t subtract many points.

ChatGPT 4.1 produced the best-looking result, although it took some liberties and decided to add elements we didn’t ask for.

To our surprise, the o3 video was the worst, even though the prompt was very detailed, both in technical and atmosphere description.

Sonnet 4 gave the second-best video, and the prompt was shorter but more detailed than that of 4.1.

Finally, Opus 4 gave probably the most detailed prompt that should have provided the best result if the video generation AI could follow each request precisely. Unfortunately, Sora made one big (and a bit funny) mistake that ruined the overall good output.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Place 4th2nd3rd1st

Claude vs. ChatGPT: Which one performed better on our tests? 

While there are standardized tests to measure how good each AI model is at different tasks, we believe that testing with real cases shows a clearer picture.

For that, we ran eight different tests, namely:

  1. 1.
    Creative writing
  2. 2.
    Coding
  3. 3.
    High processing functions
  4. 4.
    Problem solving and analyzing
  5. 5.
    Explanation
  6. 6.
    Image generation and prompt writing
  7. 7.
    Image recognition and description
  8. 8.
    Video generation and prompt writing

The main criteria for evaluating these AI chatbots were factual accuracy and the absence of logical mistakes, bugs, or hallucinations. For the writing part, we also tested the ability to crack a joke and the level of literary language. Finally, for the image and video generation, we looked for a result that’s as close to the prompt as possible.

Below are the results of each test, along with prompts and detailed explanations. You can also see all the input and output in this PDF document.

ChatGPT 4.1 creative writing test results

ChatGPT 4.1 wrote a story of 468 words and instantly scored points for the dog’s name, Barkley. However, some jokes went nowhere, like the mayor stepping into Barkley’s water bowl or Barkley shaking “imaginary fleas of villainy.”

There were some logical inconsistencies or unexplained parts, like the teenager hero producing a flashlight instead of a sword, “because he couldn’t find it.”

The plot twist revealing that the princess was hiding from the prince Chadwick (another nice name) to avoid marrying him was good. But her suggested solution to stay at the wizard was strange – after all, who’s gonna stop the king or the prince from sending other heroes to “save” her?

Barkley’s offer to stay at the tower was also weird because they had their own home. And even though the tower had good Wi-Fi, the story emphasized that the dog was all about snacks.

Then there was an inconsistency where Barkley asked for a crown-shaped chew toy but got a royal scepter version. 

And the punchline? It was doomed to fail as it repeats Barkley’s preference for Wi-Fi. 

Finally, the whole story suggests that it takes place in a fantasy world, so Wi-Fi becomes an anachronism that doesn’t add much to the plot.

ChatGPT o3 creative writing test results

ChatGPT o3 started its story of 478 words in mid-action, with our two heroes already traveling to the wizard tower. 

To our surprise, the humor had a whole different, darker tone. The first “joke” was this: “Max Thunderfoot—sixteen, cocky, and convinced that acne is a mark of valor.” Dog Barkley also had a “Certified Genius” flea collar for no apparent reason.

One of the better jokes was that while the wizard’s door had a “please knock” sign, Max kicked it “because heroes.”

This time, the princess and the wizard were creating a plan to avoid her wedding to Prince Bloodax, who just wants to conquer everything.

Compared to the ChatGPT 4.1 story, Max's and his dog's differing opinions were much better grounded. The former believes that “rules are rules,” but the latter says that “saving millions [from conquering] is worth more than breaking one rule.”

However, the third option suggested by the princess broke the story as it offered “expose the forced-marriage scheme.” To whom, we ask. And how?

Most importantly, it lacked the punchline.

To sum up, everything went pretty well, but the last 25% of the story got derailed.

Sonnet 4 creative writing test results

Sonnet 4’s story was relatively short – just 422 words. Just like o3, it started already in action, at the wizard’s door. The writing was way smoother compared to ChatGPT models. 

The first words that we could consider a “proper joke” arrived only when the wizard spoke to the teenage hero:

Jake lowered his sword. "Aren't you supposed to cackle evilly or something?"

"I cackled once. Terrible for the throat.”

The reason why Princess Melody was in the tower wasn’t funny at all – her stepmother hired assassins to kill her for the inheritance. The not-so-evil wizard turned out to be sheltering her.

The moral choice was between Jake’s idea of storming the castle to confront the stepmother directly, while his dog Buster offered to gather the evidence and present it to the king.

The princess's third choice was also a sudden anachronism and not something she couldn’t have done before. She suggested posting about her stepmother on social media and going viral. The punchline was OK, but its impact was diminished because of the princess’s lack of logic.

Overall, it was the best story of all four.

Opus 4 creative writing test results

Opus 4 already included a joke in the title, naming the princess Complainsalot. The dog also got an English pun-name, Barksworth, although his breed was a German Shepherd. The teenager had a simple name, Jake, though.

While Sonnet 4 made the dog the brave one, Opus 4 chose the teenager as the one who pushes forward. 

There were some decent jokes, like “Jake kicked open the door heroically, which would have been more impressive if it hadn't been unlocked.”

It also did a nice tie-in with the princess’s name and why she was in the tower. Turns out the wizard is a therapist, and she was sent for anger management after pushing 17 servants into the moat.

The first weird artifact showed up when the wizard said in a sad voice that he was volunteering at the orphanage instead of being proud or something like that. Also, the wizard said that the kingdom sent the princess, so it’s not kidnapping by any means. But Jake argued that “kidnapping is wrong,” and they should return the princess to the castle.

And the punchline? Well, you be the judge:

"Well," the dog grinned, "it turns out the real princess was the anger issues we diagnosed along the way."

Jake groaned. "I'm getting you neutered."

To sum up, Opus 4 fared not much better than 4.1 logic-wise, though it had somewhat better jokes.

ChatGPT 4.1 coding results

ChatGPT 4.1 needed 120 lines of code and concisely commented on its work. Here’s an excerpt from the output:

Here’s a React webpage for your e-shop, featuring a modern, interactive UI. You can copy this code into your Next.js or React app. It uses Tailwind CSS for styling, interactive sorting, and includes product image placeholders.

If you need a plain HTML/JS/CSS version, or have a different tech stack, just let me know!

We liked the explanations, and the website was nice with the placeholder icons reacting to mouseovers. The sorting worked as intended.

ChatGPT 4.1 coding test – website React code and design

Finally, we connected to Google AI Studio and asked Gemini 2.5 Pro preview 05-06 to evaluate the code on a scale from 0 to 100. The AI described it as “very good,” giving it 91/100. ChatGPT 4.1 scored maximum in functionality and correctness of core logic and nearly maximum for readability and cleanliness.

When it came to areas for improvement, it mentioned minor fixes that would benefit scalability, accessibility, and memoization.

However, I asked the tool to offer an HTML version for a fair comparison with the rest.

This one required 232 lines.

ChatGPT 4.1 coding test – website design

The design was nice, but there was a usability issue—switching the sorting from ascending to descending required clicking the same button. This got us a bit confused, and while it works great after you get it, it’s still not as good as a good old drop-down.

Google AI Studio gave it just 78/100, primarily because of the reliance on innerHTML for rendering, the use of onclick instead of addEventListener, and the global scope of JS variables or functions. Some points were also deducted due to missing accessibility details (aria-sort) and the lack of a visual sort direction indicator beyond text.

ChatGPT o3 coding results

Since this is a “reasoning model”, naturally, we had to wait longer (3 minutes) for the results. But were they any better?

The o3 went straight to 200 lines of HTML. As you can see below, the explanation was way more expert-oriented and full of terminology, parameters, and attributes.

Below is a single-file webpage—save it as furniture_shop.html, open it in any modern browser, and click the Product / Price / Date added headers to toggle ascending ↕ descending sorting. All styling is self-contained; replace each grey square (.photo-placeholder) with a real product image when you have one.

How it works

  1. 1.
    Product data is baked into the rows via data-* attributes (data-name, data-price, data-date).
  2. 2.
    Clicking a sortable header toggles the direction (asc ↔ desc) and calls sortTable, which rearranges the <tr> elements in-place.
    Tiny arrows beside the header show the active sort and its direction.
  3. 3.
    The responsive block makes the table collapse into mobile-friendly “card” rows for narrow screens.
ChatGPT o3 coding test – website design for desktop

The design looked nice, although the shop description was just one simple sentence. There were no indicators that sorting was possible, only hovering the mouse over worked. There was also a noticeable design issue – the width of the columns shifted while sorting by “Price” or “Date added.” Also, the “mobile version” design looked a bit weird, with all categories clogging the right side.

ChatGPT 4.1 coding test – website design for mobile

Despite this obvious flaw, Google Gemini gave this project 88/100, which excelled in CSS organization, JavaScript best practices, efficient DOM updates, and overall code structure. But it seems to have missed the aforementioned flaw, describing the UX as “very good.”

It did note that the sorting arrows didn’t appear until the first click, though. Plus, it again mentioned accessibility concerns due to the aria-sort attribute missing. Also, the main reason for the lower score was the data management strategy, mainly embedding and duplicating it in HTML. Generating it in JS would be the preferred way.

Claude Sonnet 4 coding results

The description of what it has done was similar to ChatGPT 4.1 in its simplicity. However, it was obvious that Sonnet 4 did way more.

I've created an interactive furniture shop webpage with all the features you requested. Here's what I've included:

Key Features:

  • Header: "Furniture Shop" with an attractive description of the product selection
  • Interactive Sorting: Dropdown controls to sort by product name, price, or date added in both ascending and descending order
  • Product Display: Clean product cards with placeholder icons (🪑 for chairs/tables, 🛏 for beds)
  • Modern Styling: Gradient backgrounds, glassmorphism effects, hover animations, and responsive design

Design Highlights:

  • Animated header with shimmer effects
  • Product cards that lift and scale on hover
  • Smooth transitions and modern color scheme
  • Mobile-responsive layout
  • Visual feedback for interactive elements

Functionality:

  • Real-time sorting without page refresh
  • Formatted dates and prices
  • Clean, organized product information display

The end result surprised us in a positive way. It was the best-looking version by far, with icons, a solid shop description, multiple colors, and gradients. Yes, the sorting arrows were too close to the button’s edge, and having two drop-downs is not that efficient, but it leaves no room for guessing how to navigate the website.

Claude Sonnet 4 coding test – website design

The responsive design worked really well.

We were also unsurprised that this took way more (326 lines) of code to create.

Gemini agreed with us, giving it a 90/100 score and calling it “very polished and visually engaging.” However, it suggested minor improvements in more areas, such as DOM Manipulation Detail, not wrapping one script in IFFE, and the same aria-sort accessibility concern as the rest of the contestants.  

Claude Opus 4 coding results

This one had 271 lines of code and looked quite different (and more ascetic) from the Sonnet’s version. However, the main UX/UI “flaw” was using separate buttons for ascending and descending sorting. Also, changing the sorting category wouldn’t automatically launch the sorting, so there’s always a minimum of two clicks before you get the result. 

The shop description was short and simple:

I've created an interactive furniture shop webpage with the following features:

  1. 1.
    Header and Description: A styled header with "Furniture Shop" title and a welcoming description of the product selection
  2. 2.
    Product Display: Six product cards displaying:
    • Image placeholder (gray box with text)
    • Product name
    • Price in green
    • Date added (formatted nicely)
  3. 3.
    Sorting Functionality:
    • Dropdown menu to select sorting criteria (name, price, or date)
    • Two buttons for ascending and descending sort
    • Works dynamically without page reload
  4. 4.
    Styling Features:
    • Clean, modern design with a dark blue header
    • Responsive grid layout that adapts to screen size
    • Hover effects on product cards (lift and shadow)
    • Professional color scheme
    • Smooth transitions and animations
Claude Opus 4 coding test – website design

Gemini gave this project a score of 85/100, deducting points for the global scope of JavaScript, missing aria-live, and the same DOM Manipulation Detail. There’s also an unused CSS class.

ChatGPT 4.1 high processing functions test

ChatGPT 4.1 offered to spend $130 on food, $120 on drinks, $210 on activities, and $40 on extras.

Food included 3 pizzas, 20 hot dogs, 4 family-sized bags of chips, as well as various sauces, and even a dessert (brownie or cookie tray).

The model assumed there would be “moderate” drinking, so it went for 72 cans of beer, 1 bottle of vodka, and 1 bottle of whiskey, plus non-alcoholic options. 

When it came to activities, it suggested avoiding expensive clubs and bars. Axe throwing was one option, but it exceeded the budget, so 4.1 suggested sticking to 2 hours of bowling for $130. We liked that it also included the Uber/Lyft expenses.

While it might be challenging to find a bowling alley with lanes for 10 people, splitting into two five-stag groups and mixing them up doesn’t sound too bad compared to some suggestions from the other tools.

Then, it gave a sample timeline, with the party starting in the backyard, going bowling, and returning for more games and snacks.

ChatGPT 4.1 also gave pro tips for staying on the budget, such as choosing Costco, checking for groupons, and booking the bowling alley in advance.

Overall, this was quite a nice plan for a shoestring budget event. However, it didn’t offer anything location-related.

ChatGPT o3 high processing functions test

Naturally, the o3 model spent way more time “reasoning” the options. But was it worth the whole time to compute and the tokens spent?

The model did quite a lot of web browsing to confirm the prices in Los Angeles. It took 4.5 minutes in total.

The allocation was $90 for food (considerably less than 4.1), $140 for drinks, and $215 for fun and games. 

It went for 5 pizzas, arguing that “Pizza = maximum calories per dollar and no cooking fuss,” and 3 bags of chips. It also added “chips/salsa/guac that satisfy grazing between games and pair well with beer & tequila.”

The drinking plan was less beer-focused (48 cans), suggesting 1.75 liters of Jose Cuervo Especial Silver tequila and 750 ml of Jack Daniels instead. The o3 model also said that “water, cups, soda, and ice keep everyone safe & hydrated.”

It had exact options for activities: Sherman Oaks Castle Park 18-hole mini-golf, spikeball, beer pong, and Cards Against Humanity. However, 10 people playing mini-golf might be a bit too much.

We liked that in case somebody owned Spikeball or Cards Against Humanity, the extra money should be spent on “booze” or extra mini-golf.

Then, o3 gave some advice, such as picking the pizzas from Costco’s food court right before the guests arrive. It also said that mini-golf is “walk-up friendly,” though we didn’t mention where exactly everyone meets. The model also suggested that spikeball can be played on Santa Monica or Venice Beach to “stay coastal.”

Finally, it reminded everyone to stay “hydrated,” which is a very humane, even motherly thing to do. It also sent congrats to the groom-to-be.

Claude Sonnet 4 high processing functions test

According to Sonnet 4, food and drinks “are the foundation of any good party,” so it allocated two-thirds of the budget for it, leaving just 25% for the activities and 4% for extras. Let’s break it down.

For $150, Sonnet 4 offered to get BBQ and grilling essentials for hot dogs and hamburgers, adding chips and snacks. While this is nice and all, this means assuming that one of the stags owns the grill and has a backyard (the AI noted that).

For drinks, it offered 96 cans of beer, 2 mid-range whiskeys, and 1 vodka. Now that sounds more like a stag party! And it still allocated $10 for 2 cases of water to stay hydrated.

Sonnet 4 said it put heavy emphasis on beer as it’s a stag party staple and suggested leaving the spirits “for special toasts.”

The chatbot also decided to keep all the entertainment and activities in the backyard, suggesting cornhole, poker, and other card games. However, just like the mini-golf, ten people playing cards might be a bit too much.

It also suggested getting a Bluetooth speaker ($40), three disposable cameras for “memorable photos without phones”, $20 for party decorations, and $35 for 10 mid-range cigars as an optional over-the-budget “upgrade for the groom.”

We liked that it offered a number of cheap stores, such as Costco, Sam’s Club, and Trader Joe’s. And just like o3, it recommended specific LA locations, namely Venice Beach or Griffith Park.

We also liked that it gave a friendly reminder about the “local noise ordinances” in case you're staying in the backyard late. 

Claude Opus 4 high processing functions test

Opus 4 surprised us with a brief response. Like Sonnet, it offered a BBQ with hamburgers and hot dogs, adding 5 lbs of chicken wings and spending $150 in total. For drinks, it offered just 60 cans of beer (but a “decent” one) and $40 for vodka or whiskey. 

Opus 4 suggested Go-Kart racing for activities, but admitted it would be over the budget. Instead, it offered to hit the beach and do some frisbee or volleyball, the latter probably not as fun as bowling or mini-golf when you’re drinking. Opus 4 also mentioned poker and cornhole.

It also offered to consider the BYOB approach and hit the happy hours in Santa Monica or Downtown LA.   

ChatGPT 4.1 problem solving and analysing test results

ChatGPT 4.1 provided seven common causes of computer slowdown, involving issues like too many background processes, outdated drivers, and low disk space or fragmentation. It also warned about malware.

The chatbot mentioned that an aging SSD could be the cause, but in general, SSDs should be fine for at least five years. Of course, that depends on the amount of data you write into them.

Next, 4.1 offered a step-by-step guide for each of the 7 cases. For malware, it suggested scanning with Windows Defender or reputable third-party software, “such as Malwarebytes.”

In general, all guides were brief but pretty easy to understand and follow.

If all of the above failed, 4.1 suggested creating a new user to see if the performance improved. Another piece of advice was to reset Windows (without reinstalling) with the option to “keep my files.”

Finally, the last two options were a complete reinstall and a RAM and SSD checkup. It also gave a general tip to restart the laptop frequently and keep the programs to a minimum.

ChatGPT o3 problem solving and analysing test results 

After less than two minutes of thinking, o3 came back with nine possible reasons and a detailed description of each one. It did proper web research, suggesting some causes like the new Windows Startup Boost that appeared only in May 2025.

We were pleasantly surprised that some of the suggested reasons, such as BitLocker software encryption or Explorer.exe memory leak (KB5055523, April 2025), were very specific.

When it came to SSD health, it also offered a nuanced suggestion that relates to old firmware or truncated TRIM.

Then, the model wrote some “classic maintenance steps,” such as running system file checks and performing a clean boot test.

Probably the nicest thing about o3 response was the “suggestion tree”:

App launches still > 5 s after steps 1–3?
▸ Yes → run a clean boot; if fast, re-enable startup items one-by-one.

With the final being:

Still slow after repair + drivers + firmware?
▸ Time to backup and clean reinstall, or consider an SSD / RAM upgrade.

Sonnet 4 problem solving and analysing test results  

Sonnet 4 offered a simple, unformatted list of nine possible causes and ways to fix them. There weren’t any tips that ChatGPT didn’t mention, save for booting from the Linux USB drive “to determine if it's a hardware versus software problem.” However, there were no further instructions on creating such a USB and what would show whether software or hardware is the case.

Overall, the answer was on par with 4.1 with two extra tips, but way less detailed than o3.

Opus 4 problem solving and analysing test results

Opus 4 gave a more detailed answer than Sonnet 4 with more step-by-step instructions. In total, it offered seven possible causes and six things to do if the problem persists.

Among the former were accumulated temporary files and cache (not mentioned by other models), but the seventh one—visual effects and animations—shouldn’t be considered a possible cause for a mid-tier laptop.

The latter had a new tip to check the event viewer and see if any critical errors or warnings would coincide with the slowdowns.

It didn’t go the extra step, though, and didn’t offer advice on what to do if the Windows reinstall didn’t help.

ChatGPT 4.1 explanation test results

ChatGPT 4.1 offered an explanation in 433 words. It gave a pretty good analogy with a violin string and many ways it can sound. In the same way, strings in this theory can be different particles, like electrons or photons.

However, it repeated itself by giving another analogy to guitar strings.

The chatbot also gave three strong and weak points of the theory that sit well with its current status.

Unfortunately, we felt it lacked explanation of its other elements, such as extra dimensions as a requirement for it to be true. It also failed to mention that there are multiple string theories.

ChatGPT o3 explanation test results

After reasoning for 39 seconds, it spewed a well-structured 468-word response. The model already surpassed 4.1 by adding that the string's vibration determines not only the particle type but also its properties, such as charge, mass, or a different force of nature.

ChatGPT o3 explained five concepts in total, adding extra dimensions, branes, supersymmetry, and m-theory. For extra dimensions, it gave a pretty nice analogy:

An ant on a garden hose: from far away the hose looks 1-D, but up close the ant also walks around the circular dimension.

It gave three pros and four cons of the theory and offered a “balanced evaluation,” but it used quite a lot of unexplained or poorly explained jargon.

It ended with a nice analogy, though: Think of it as a grand architectural blueprint: elegant and self-consistent, yet awaiting the engineering test of reality. 

Sonnet 4 explanation test results

Sonnet 4 was more verbose than both ChatGPTs, using 637 words. It explained three key elements – the strings themselves, the extra dimensions, and the unification. 

For extra dimensions, it gave the same o3 garden hose example but didn’t put an ant on it, which would have added an extra dimension for the explanation.

We appreciated the attempt at an analogy it used for the unification: 

Instead of needing separate instruction manuals for your TV, microwave, and computer, imagine having one universal manual that explains how everything electronic works.

However, we do have a manual on how electronics work in general, and it's not clear whether the model meant using all electronics with a universal (remote) controller or something else.

Furthermore, Sonnet 4 gave four pros (adding that it predicts new physics) and five cons of the theory, providing an excellent explanation for its untestability: we’d need particle accelerators as big as galaxies for that.

It also criticized the string theory because of its many versions, and that it mostly explains what we know instead of new phenomena. 

Just like the o3 model, Sonnet 4 ended with an amusing analogy :

Until we can test it, we're essentially admiring a masterpiece painting while wondering if it's a photograph of reality or just a very convincing work of art.

Opus 4 explanation test results

Much to our surprise, the Opus 4 explanation was shorter than Sonnet 4, using just 477 words. It started pretty complicatedly, saying that “Traditional physics imagines particles as tiny dots with no size.”

It also used the garden hose example (still, no ant!) and said there could be millions of possible string theories, but just one describes our universe.

To sum up, it was more concise and less understandable than Sonnet 4.

Analyse this image and answer the following questions:
What is this object?
What it is used for?

At what time in history it was used?

Was it replaced by another object?

What it is made from?

Is this a real or AI-generated image?
Explain how you came up with the answers.

Image of a phonograph generated with VisualElectric Imagen 3

You can find the outputs in this PDF document.

ChatGPT 4.1 image recognition and description test results

ChatGPT 4.1 answered all the questions really well in 294 words. It also said that the image is either made by an AI or “extensively digitally rendered” due to “clean, almost hyper-realistic texture, perfect lighting, and minute flawlessness in the details (wood grain, brass reflection, and lack of any dust or imperfections).”

ChatGPT o3 image recognition and description test results

The o3 model also had no problems describing the phonograph in 310 words. But it also gave a bit more details than 4.1, explaining what the turn-table plate and the internals were made of. 

The AI chatbot provided more details when determining the decline of gramophone usage. 4.1 mentioned the 50s while o3 said that it started earlier when the “electrically-amplified record players arrived in the mid-1920s and radios became common in the 1930s.” 

ChatGPT o3 was also correct in deciding that it was an AI-generated image and gave more arguments for it. However, it left out the possibility that the image was artificial but made without the help of AI.

Sonnet 4 image recognition and description test results

Sonnet 4 also did a great job in 298 words, sometimes giving more and sometimes less detailed explanations than o3. For instance, it said that gramophone use began to decline in the 1930s, but didn’t say it was due to the radio or its electric version.

However, the model gave more detailed material descriptions:

The horn appears to be made of brass (evidenced by its golden metallic appearance), while the base cabinet is made of wood, likely mahogany or another hardwood given its rich brown color.

It also provided well-grounded arguments why this was an AI-generated image, noticing that the background looks like a composite of period-appropriate elements. Of course, one might say that the photo was taken in a museum or from a period movie.

Opus 4 image recognition and description test results

Opus 4 also had no issues, using 279 words and citing the peak popularity period of gramophones as 1910-1930s. 

Just like Sonnet 4, it gave more detailed guesses about the materials (mahogany or walnut for the cabinet base). 

It said that “the turntable platter appears to be covered in felt or similar material,” but didn’t say what the platter itself was made from (yes, we’re nitpicking here).

There were no issues with detecting the AI origins, and the arguments for that were solid.

This was our prompt:

Generate a photo-realistic portrait of a middle-aged man with a green fedora and a wide smile with a golden tooth. The man has earrings and a goatee.

ChatGPT image generation test results

The result was not shabby at all. However, it forgot to add the golden tooth, although in its response, the AI said that the man is “flashing that gold-tooth grin.”

ChatGPT test – image generation according to prompt

Image generation prompt

For the second part, we asked all four models to create a prompt for an image-generating AI and checked which result was the most precise and detailed. We didn’t subtract many points if the image itself was sub-par, as this has more to do with the AI (Visual Electric Ideogram 3) used for generation.

Here’s our request for prompt generation:

Create a detailed prompt for an AI image generator. It should depict a photo-realistic portrait of a man in an outdoors background. Here are the details for the portrait and the background.

The portrait: a middle-aged man with a green fedora and a wide smile. He has one golden tooth. The man has earrings and a goatee. Use cool colors.

The background: a desert. The sun is at its zenith, behind the man, making him darker than the rest of the image. There’s a pyramid in the horizon. The background is slightly blurry, as if seen in a mirage. Use warm colors.

You can find the outputs in this PDF document.

ChatGPT 4.1 image generation prompt test results

The 4.1 prompt was nice – it added details like the goatee should be well-groomed (VisualElectric failed on this and added one extra pyramid) and decided to match the clothes with the fedora color. We don’t have much to complain about.

Image of a man in a desert generated by VisualElectric according to the ChatGPT 4.1 prompt

ChatGPT o3 image generation prompt test results

The o3 model clearly did a better job, giving a longer and way more detailed prompt with aspect ratio, style keywords, and negative prompts. Interestingly, it changed “photo-realistic” to “ultra-realistic,” which might have resulted in an too-AI’ish output.

The model also provided extra details, such as “short salt-and-pepper goatee” and “golden incisor,” and used more terms like hues, bokeh, soft-focus, or pale amber. 

ChatGPT o3 also dedicated a whole paragraph to describing the supposed camera used in the picture, including its lens and resolution, and asking for “no harsh HDR artifacts.”

Finally, it described the overall mood as adventurous, approachable, and timeless.

There’s no way that VisualElectric could have given us everything. And it didn’t.

Image of a man in a desert generated by VisualElectric according to the Claude Opus 4 prompt

It ditched the extra pyramid but also forgot to include the golden incisor (maybe because the prompt lacked the word “tooth”). The portrait also had warmer colors, probably because the prompt was a bit too much for Ideogram 3.

Sonnet 4 image generation prompt test results

Sonnet 4 was brief like 4.1, and decided to add a nice detail that the golden tooth of the man catches the light. The color descriptions were more precise than 4.1, going for “burnt oranges”, “sandy beiges,” and the like.

It also decided that the man should have not just simple but “distinctive” earrings.

The result? As the prompt asked, the AI put the sun “directly behind the man”, ignored the golden tooth, but remembered the earrings. Visual Electric also went for two pyramids and even some birds. However, the image was the least realistic of all four.

Image of a man in a desert generated by VisualElectric according to the Claude Sonnet 4 prompt

Opus 4 image generation prompt test results

Opus 4 decided not to waste much time on this, providing the shortest description of less than 200 words. 

It went back from Sonnet’s “distinct” to hoop earrings and added that the sun directly behind the man should create a “dramatic halo effect.”

Similarly to Sonnet 4, it ended with a style description, but it was more photo-related, using phrases such as “shallow depth of field” or “professional DSLR quality” (yeah, we also had to google it).

The result?

The sun ended up in front of the fedora, “small hoop earrings” were medium-sized at best, and the portrait still used warm colors. We also didn’t notice the “high contrast” between the man and the background.

We opted for a 5-second 720p video with an aspect ratio of 3:2. Here’s our prompt:

A crowd dressed in white dancing and fist-pumping in an underground club. A female DJ with a green mohawk is playing techno music with turntables, nodding to the rhythm.

Video of a party in a club generated by OpenAI Sora

Sora didn’t miss any details, except maybe the “nodding to the rhythm” part. However, the humans, especially the DJ, looked quite artificial.

Video generation prompts

Now, to invite Claude to the party, we asked all four AIs to generate a detailed prompt based on our request to see which would provide better instructions. Here’s what we asked for.

Provide a detailed prompt for a video creation AI using the following details.

Video length: 5 seconds.

Video setting: 1920s – describe the objects so the AI creates them for this setting.

Scene: A 10-year old girl is riding a bicycle on a countryside road from left to right. She’s wearing a red polka-dot dress. Her hair is black and her shoes are brown. She has a brown backpack.

Background: a countryside in the US south during the summer. It’s flat, green grass everywhere, and a forest on the horizon.

Time of the day: an hour before sunset. The sun is on the left side.

You can find the outputs in this PDF document.

ChatGPT 4.1 video generation prompt test results

ChatGPT 4.1 created a very detailed video prompt, full of technical specifications (the camera is set at about the girl’s height, slightly to the side, and pans slightly as she rides by) and in-depth descriptions on creating that 1920s feel.

However, the decision to add “vintage mailboxes” was a bit weird since the prompt didn’t include any houses. Also, the time of the day should’ve been one hour before sunset, but ChatGPT went for afternoon (maybe children should be at home by then?!). It also contradicted itself later, saying that the sun should be low on the left side.

Video of a girl riding a bicycle generated by OpenAI Sora according to the ChatGPT 4.1 prompt

Sora did a pretty good job, although it went with white shoes instead of brown. The sun was invisible but apparently on the right side because the shadow cast by the girl and the bicycle was on the left.

ChatGPT o3 video generation prompt test results

The o3 model’s response was similar in length (nearly 400 words) but more structured, with bullet points and short sentences.

The descriptions or requirements were also more concrete, such as “Static tripod shot, 16 × 9 aspect, eye-level at ~1.3 m (child’s height) looking straight down a flat dirt road.”

And where 4.1 asked for a classic black or dark green bicycle with a simple round frame,” the o3 model went for “child-sized single-speed frame, dark-green paint slightly chipped, swept-back handlebars, chrome bell, balloon tires, full metal fenders, small wicker basket up front (empty).”

In theory, this should help Sora deliver a better (or more precise) result. In practice, it was short of a disaster. Not only did most of the details not make the cut, but the girl didn’t fit the frame, there were three pedals, no backpack, and no background.

Video of a girl riding a bicycle generated by OpenAI Sora according to the ChatGPT o3 prompt

One thing it got right was the sun—it was not visible, but the shadows cast on the right looked pretty realistic.

Sonnet 4 video generation prompt test results

Sonnet 4 went for a much shorter description – just 240 words. Just like the o3 model, it structured the prompt in headings and bullet points.

The description details were somewhere between 4.1 and o3, with “Black hair styled in period-appropriate bob cut or braids” and “1920s children's bicycle with high frame and large spoked wheels, simple design with minimal gears, leather seat.”

It didn’t offer to add any extra stuff, such as mailboxes or electricity posts, just atmosphere-enhancing ones like “Gentle breeze effect on the girl's dress and hair.”

Video of a girl riding a bicycle generated by OpenAI Sora according to the Claude Sonnet 4 prompt

Unfortunately, Sora didn’t do a good job. There was no backpack or white shoes, and the sun was low but at least on the right. Also, it looked as if the bicycle was gliding above the road.

Opus 4 video generation prompt test results

Opus 4’s description was longer than Sonnet’s (300 words vs 240) and structured in the same manner. It had an extra heading named “motion dynamics” for hair motion and bicycle wobble. 

The model also offered more optional details, such as “bicycle basket in the front” or “fireflies beginning to appear.” 

Also, objects had more details in general, such as “country road with visible wheel tracks” or “wheels creating small dust clouds.” Once again, we didn’t expect Sora to pick up any of these.

Surprisingly, Sora decided that this description violated its policies and refused to generate. We tried to remove each prompt segment one by one to find the culprit. For some reason, it turned out to be the bicycle details.

The censored results turned out to be pretty OK, except the fact that the bicycle was moving… backwards.

Video of a girl riding a bicycle generated by OpenAI Sora according to the Claude Opus 4 prompt

Other than that, we got the dust and the lighting from the left (even though the setting sun was not visible). Sora also added puffy short sleeves, brown boots, and a wicker bag, but forgot the backpack.     

Claude vs. ChatGPT: What have we learned?

There were quite a few things that we’ve learned from this extensive testing of Claude vs ChatGPT. Here are the key takeaways:

  • Individual and Pro plans cost almost the same, but ChatGPT offers more bang for the buck. Its API access is also way cheaper.
  • The Claude and ChatGPT benchmark results are pretty similar, save for some instances, like Reasoning or Math, where the 4.1 is far behind.
  • Anthropic and Open AI offer intuitive and user-friendly interfaces, but the latter has more tools for coding and writing that can make your life easier.
  • ChatGPT has more extra features that include a bunch of custom-made GPTs for specific tasks like scientific research. It also lets you build your own GPTs.
  • Claude won 1st place in Creative writing and Video generation prompts.
  • ChatGPT won the 1st place in High processing functions, Problem solving and analysing, Explanation, and Image recognition and description.  
  • Both Claude and ChatGPT shared the top spot for coding.

Here’s the detailed table with all test results in one place and the final score. Coding and writing had more weight than the rest of the categories when deciding the overall winner. We also subtracted the points from Claude due to its lack of image and video generation.

ChatGPT 4.1 ChatGPT o3 Sonnet 4 Opus 4
Ease of use and user experience 1st1st2nd2nd
Extra features 1st1st2nd2nd
Integration and customization 1st1st2nd2nd
Creative writing 4th2nd1st3rd
Coding 2nd3rd1st3rd
High processing functions 1st2nd3rd4th
Problem solving and analysing 3rd1st2nd4th
Explanation 3rd1st2nd3rd
Image recognition and description 2nd1st2nd2nd
Image generation YesYesNoNo
Image generation prompt 4th1st3rd2nd
Video generation YesYesNoNo
Video generation prompt 4th2nd3rd1st
Final place 3rd1st2nd4th

Claude vs. ChatGPT: Which one do we recommend?

In short, choose Claude if you need the best AI for creative writing and coding. Choose ChatGPT if you want more versatility and extra features, such as image and video generation, voice output, or the option to create custom chatbots.

However, at the pace that the LLM model training and the AI technology in general are advancing, nothing is set in stone. For instance, Google Gemini is already miles ahead in video generation with its Veo 3. Therefore, we’ll continue updating this Claude vs ChatGPT and other comprehensive comparisons to provide you with the most accurate information about the best AI tools in 2025.

Therefore, we’ll continue updating this Claude vs ChatGPT and other comprehensive comparisons, such as Perplexity vs ChatGPT, to provide you with the most accurate information about the best AI tools in 2025. 

blog author Karolis
Karolis Pilypas Liutkevičius

Karolis Pilypas Liutkevičius is a journalist and editor exploring the topics of AI industry.

abstract grid bg xs
One platform for AI orchestration, zero complexity

Be one of the first to see nexos.ai in action — request a demo below.