First look at Claude and ChatGPT
Both Claude and ChatGPT are great large language models (LLMs), but their main strengths lie in different areas. Claude is recommended for coding and creative writing due to its more natural-sounding style. However, ChatGPT is more versatile and offers extra features, such as image and video generation.
Here are the main benefits and drawbacks of ChatGPT and Claude discussed in more detail.
Benefits of Claude
The key benefit of Claude is its natural writing style. Its texts are more coherent and natural-flowing instead of being an amalgamation of paragraphs taken from different sources. Claude is also great for coding – it can give real-time feedback and visualize the end results with its Artifacts feature.
Drawbacks of Claude
The main drawbacks of Claude include its lack of features, namely image and video generation along with voice output. What’s more, Claude falls short in high processing functions, problem solving, and explanation. Finally, it could benefit from a larger context window.
Benefits of ChatGPT
The main benefit of ChatGPT is its versatility – it might not be the best in areas like writing, but it’s good enough in most cases. Where ChatGPT really shines is the number of features – you can have a voice conversation with it, create videos, images, and even your own specialized GPTs. Finally, all of this comes at a reasonable price, especially if you choose the API option.
Drawbacks of ChatGPT
ChatGPT’s versatility is also its main drawback. While it’s good for writing and coding, it’s not the best LLM for these tasks. For instance, it may sound robotic, with no artistic touch. Its code can also be far from optimal.
Claude vs. ChatGPT: a quick feature and pricing comparison
| | Claude | ChatGPT |
|---|---|---|
| Company | Anthropic | OpenAI |
| Latest AI models | Sonnet 4, Opus 4 | ChatGPT 4.1, ChatGPT o3 |
| Context window | Sonnet 4: 200,000 tokens (about 150,000 words) Opus 4: 200,000 tokens (about 150,000 words) | 4.1: 1,047,576 tokens (about 786,000 words) o3: 200,000 tokens (about 150,000 words) |
| Maximum output | Sonnet 4: 64,000 tokens (about 48,000 words) Opus 4: 32,000 tokens (about 24,000 words) | 4.1: 32,768 tokens (about 24,600 words) o3: 100,000 tokens (about 75,000 words) |
| Training data cut-off | Sonnet 4: 2025 March Opus 4: 2025 March | 4.1: 2024 June o3: 2024 June |
| Web search | Sonnet 4: Yes Opus 4: Yes | 4.1: Yes o3: Yes |
| Image generation | No | Yes (GPT Image 1) |
| Video generation | No | Yes (Sora) |
| Voice output | Sonnet 4: No Opus 4: No | Yes |
| Input types | Sonnet 4: Text, image Opus 4: Text, image | 4.1: Text, voice, image o3: Text, voice, image |
| Available on free plan: | Sonnet 4: No Opus 4: No | 4.1: No o3: No |
| Pricing (user) | Pro – $17/month Max – from $100/month | Plus – $20/month Pro – $200/month |
| Pricing (teams) | Team – $25/user/month if paid annually, $30/user/month if paid monthly, minimum 5 users | Team – $25/user//month if paid annually, $30/user/month if paid monthly, minimum 2 users |
| Pricing (API) | Sonnet 4: $3 per 1M input tokens $0.30 per 1M caching input $15 per 1M output tokens Opus 4: $15 per 1M input tokens $1.50 per 1M caching input $75 per 1M output tokens | 4.1: $2 per 1M input tokens $0.50 per 1M caching input $8 per 1M output tokens o3: $2 per 1M input tokens $0.50 per 1M caching input $8 per 1M output tokens |
The data in the table above already shows some key differences between ChatGPT and Claude:
- ChatGPT 4.1 offers a context window that is more than five times larger. This is crucial when working with large datasets or documents.
- ChatGPT o3 has the biggest maximum output. This is essential for creating large datasets or documents.
- The input-output ratio is heavily in ChatGPT o3's favor. It’s around 2:1 for o3, 3:1 for Sonnet 4, 6:1 for Opus 4, and a whopping 32:1 for 4.1. The latter stat means that if you want your debut epic fantasy saga manuscript to be edited, you’ll have to run it at least thirty-two times and stitch it together manually.
- Claude has much more recent data, with the cut-offs made in March 2025. However, this advantage can be virtually negated by using the web search feature.
- Only ChatGPT can generate images and videos. This is a big plus if you need an all-around tool.
- Only ChatGPT offers voice chat. You can select from different voices and have a conversation that’s later transcribed.
- ChatGPT’s Team plan requires just two users ($50/month), compared to Claude’s five ($125/month).
- ChatGPT gives more bang for almost the same amount of buck, unless you have no use for image and video generation or voice input and output.
- ChatGPT’s API is considerably cheaper, especially when it comes to output tokens.
To sum up, ChatGPT offers more for the same (or even better) price. However, if you have zero interest in creative tasks, such as image and video generation, Claude may be a great option as well. That is, unless you need a much larger context window and output. If you'd like to learn more about various stages of ChatGPT, you can take a look at our guide on ChatGPT evolution.
We also want to point out that it’s possible to further reduce AI usage costs with the help of AI orchestration platforms, such as nexos.ai. They can help limit budgets and save tokens with caching.
Please also note that we’re comparing quantity over quality at this stage. Read on to learn how each model fares in different tasks as we put them through a series of meticulously designed tests.
Model capabilities: benchmark results
In this section, we’ll talk about what you can and can’t do with each model. Also, we’ll check their performance based on popular AI benchmarks that test how good a certain tool is at different tasks, such as coding or writing tasks.
Language understanding and generation benchmark results
ChatGPT offers a much larger context window (786,000 words for 4.1) and output (75,000 words for o3), which is crucial when working with huge documents.
However, Claude’s texts feel more human, meaning this AI is better for writing.
According to the LiveBench language test with questions last updated on April 25, 2025, Opus 4 (76.11) and o3 with high “reasoning effort” (High) settings (76.00) are the clear leaders, leaving Sonnet 4 in “thinking mode” (Thinking) behind with 70.19. Much to our surprise, ChatGPT 4.1 scored just 54.55.
Coding benchmark results
Claude and ChatGPT can be really handy for writing or debugging code. But which of the four AI assistants is the go-to choice?
Or tests of real-world applications described later didn’t show a big discrepancy between them, but LiveBench tells another story. According to it, o3 with medium reasoning (Medium) (77.86) and Sonnet 4 (77.54) led the way, with Opus 4 Thinking taking third place (73.25). Once more, ChatGPT 4.1 was the last, although not by much (73.19).
Reasoning benchmark
Reasoning is an important overall capability of an AI, showing whether it can solve different tasks requiring logic, “common sense,” and so on. In our test, this was most evident in fiction writing, where all four models failed to avoid logical inconsistencies. We named Sonnet 4 and ChatGPT o3 the winners of the first and second place, respectively.
LiveBench showed similar results: Sonnet 4 Thinking (95.25) and o3 High (93.33) scored the highest. Opus 4 Thinking was third with (90.47), and ChatGPT 4.1 lagged way behind with 44.39.
Data analysis benchmark results
We didn’t run a special test for data analysis, such as combining or reformatting tables and similar complex tasks, because LiveBench showed no significant difference between Claude and ChatGPT.
The Opus 4 Thinking (70.73) and Sonnet 4 Thinking (69.84) lead the way, but o3 Medium (68.19) and 4.1 (66.40) weren’t far away.
Mathematics benchmark results
We didn’t test mathematics ourselves and let LiveBench be the judge. The test included questions from high school math competitions, such as AMC12 and USAMO.
Claude’s models took the stage again, with Opus 4 Thinking (88.25) and Sonnet 4 Thinking (85.25) scoring the highest. But the o3 High was right behind (85.00), which cannot be said about the 4.1 (62.39).
Other capabilities
While the 4.1 model was often left behind in LiveBench tests, our own test results didn’t show such a big difference. Plus, we shouldn’t forget that Claude’s AI is limited to text output, whereas ChatGPT can give you images and videos.
Also, Claude AI models allow voice input and output, which can be really convenient, especially for mobile users and accessibility needs.
Ease of use and user experience
Both Claude and ChatGPT are easy to use, providing an intuitive and uncluttered user interface. You can access them either via a web interface or API access.
However, each comes with pros and cons that may matter to some users.
ChatGPT has a big usability advantage due to its voice mode. You can choose from 9 distinct male and female voices to find the one that’s most pleasing to your ear. There’s even a selection of British accents!
It can do a web search just like the text version. After finishing the conversation, you’ll find its transcript in the main window.
The main chat interfaces have a text input field and some extra options. ChatGPT lets you add photos or files from your computer. Alternatively, you can connect your Google or Microsoft Drive. In contrast, Claude has a “take a screenshot” feature and allows Google Drive and GitHub uploads. The latter will be nice for developers.
ChatGPT’s tools option gives “create an image”, “search the web”, “write or code”, and “run deep research”.
The logic of Claude’s “Search and tools” button is a bit different. Upon pressing it, you can toggle web search and extended thinking, run a query in Drive, Gmail, or Calendar, and add integrations.
Furthermore, ChatGPT gives three themes: system, dark, and light. In addition to those, Claude also lets you choose a default, system, or dyslexia-friendly font.
Write mode
With ChatGPT, clicking “Tools” and “Write or code” puts you in a canvas mode, moving the prompt area to the left. On the bottom right, you get a specialized “Suggested edits” toolbox with options such as changing the reading level, adjusting the length, or adding emojis. However, as it expands only after hovering your mouse over it, you can easily miss it on larger screens, as there are no other elements around it.
On the top right, you can see changes, revert to the last version, and copy or share text.
In the meantime, Claude’s “Write” button works differently. It offers options like “Develop podcast scripts” or “Write compelling CTAs.” When ChatGPT leaves a narrow left side, it splits the screen in half, but both can be adjusted to your taste.
Claude’s “canvas” writing mode, Artifacts (when content appears in a dedicated window alongside the conversation), doesn’t offer specialized tools like ChatGPT did. Your only options are to copy, save as markdown, PDF, or publish, making the piece available to anyone on the web.
Moreover, Claude lets you customize the chatbot's response style. It can be Normal, Concise, Explanatory, or Formal. You can even create a version tailored to specific needs.
Coding mode
When using ChatGPT’s canvas for code, the tools on the bottom right change to code review, port to language (PHP, C++, Python, JavaScript, TypeScript, and Java), fix bugs, add logs, and add comments.
ChatGPT canvas for code before and after adding comments.
Just like writing, clicking on Claude’s code button also gives suggestions like “Create technical specifications” or “Develop coding standards.” Again, it doesn’t have predetermined “tools” like ChatGPT, but it comes with a big plus—it allows you to switch between the code and its implementation preview.
That being said, ChatGPT has its separate and comprehensive Codex mode, a software engineering agent, which can be connected to GitHub and used to work with your code.
Claude’s other buttons
Claude also has “Learn” and “Life stuff” buttons that offer similar suggestions but don’t make much sense as they don’t act as new “modes”—you can click the suggestions, and that’s basically it.
Claude’s Research mode
Claude also has a beta research mode to add your data to the web search. According to Anthropic, this mode makes Claude operate as an agent, conducting multiple searches to build on each other and determining what to research next. The end result comes with easy-to-check citations.
We tested Opus 4 in regular and “research” modes to see if there was any noticeable difference. For that, we used a simple but specific prompt: Find the five most common cyber attack types on organizations in the UK in 2025.
The difference was already visible in the number of sources scanned. The regular search engine used 20 sources and had this Top 5: phishing, impersonation, ransomware, DDoS, and supply chain. However, it didn’t say how many organizations experienced impersonation attacks.
The Research took over 10 minutes and checked over 220 sources. It gave the Top 5 according to the frequency of each cyberattack, and was different from the simple answer. It was phishing, business email compromise, malware infections, account takeovers, and ransomware. It also prepared a detailed report with sources.
Overall, we see ChatGPT as the winner in ease-of-use and usability. It has more extra features you can integrate, offers specialized tools to improve your writing and code, and has a vast set of custom GPTs that can ease your work. In other words, with OpenAI, you get the whole ecosystem of AI tools, while Anthropic is just starting to build around the base service.
Extra features
This one is a no-brainer—ChatGPT offers way more additional features than Claude. It all starts with image and video generation and continues with voice input and output.
Furthermore, it has plenty of custom GPTs for specific uses, such as coding in Python or a grammar checker.
ChatGPT also supports 37 languages, and Claude still has this in beta, with 11 options at the moment. You can access Claude in 176 countries, while ChatGPT is available in 185, although their list was updated a year ago.
When it comes to security, ChatGPT offers multi-factor authentication (MFA). With Claude, you can use it only by logging in with an email that supports MFA, such as Gmail.
Integration and customization
When it comes to integration, Claude offers more opportunities than ChatGPT. It supports adding popular Google apps (Drive, Gmail, Calendar), GitHub, and allows custom integrations.
Meanwhile, ChatGPT gives you Google Drive, Microsoft Drive, and GitHub (beta).
However, ChatGPT is a clear winner in the customization field. You can create custom GPTs or use those that are already available. Among them, you can find SciSpace for searching 278M+ research papers, a writing coach, a brand logo designer, and more.
ChatGPT lets you personalize the model use with custom instructions, where you can give the
model traits and tell what it should know about you.
You can also choose if it saves anything about you in its memory and uses that information to answer future requests.
On the other hand, Claude offers more options related to text output. As previously mentioned, you can choose the answer to be normal, concise, explanatory, or formal. Also, it lets you create and edit your own styles.
Claude also lets you select your work function to better suit your needs and offers a beta version of storing your preferences. There’s also a feature preview of an Analysis tool that can write and run code to process information, run analysis, and produce data visualizations in real time.
Overall, ChatGPT wins in terms of integration and customization. While Claude is better at the former, the latter gives way more flexibility for different users.
Creative writing
Claude excels in weaving words, at least by its reputation. After testing, this turned out to be true, although the difference was not that big. In fact, o3 managed to outscore Opus 4, but Sonnet 4 saved the day for Anthropic. You can find the input and output for this and other tests in this PDF document.
Writing a funny story about saving a princess turned out to be one of the most complex tasks for the contestants. The main criteria were the quality of the humour (yes, that’s pretty subjective), the logic of the story, and an ending with a strong punchline.
Sonnet 4’s story was more elegantly written, had pretty good jokes, and just one logical fallacy, while o3 also had only one but lacked a proper punchline. Also, we noticed that Sonnet 4 gave a considerably shorter story of 422 words than the rest of the models when the requirement was to aim for around 500.
In the meantime, Opus 4 and 4.1 prose had more than one logic issue, but the former’s jokes were a bit better.
Find out more about the task and the results in the test section for creative writing below.
Coding
At first, it seemed that ChatGPT 4.1 did the best job in our coding task, which was creating a simple eshop. It was the only model to use React instead of plain HTML, CSS, and JavaScript.
However, when asked to ditch React, 4.1 delivered the worst result, evaluated just 78/100 by Gemini on Google AI Studio.
Therefore, Sonnet 4 was the winner both design-wise and code-wise. The table below summarizes the results.
Find out more about the task and the results in the test section for coding below.
Disclaimer: the author of this article is not a developer, which is why the code was evaluated with Gemini. Also, we tested only the front-end part and will be adding a more in-depth comparison in the future.
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Code | React: 91/100 HTML: 78/100 Average: 85/100 3rd | HTML: 88/100 2nd | HTML: 90/100 1st | HTML: 85/100 3rd |
| UI/UX | 2nd | 4th | 1st | 3rd |
| Total | 2nd | 3rd | 2st | 3rd |
High processing functions
For our high processing functions test, we asked ChatGPT and Claude to create a budget for an on-budget stag party in LA and explain the reasoning.
Most of the plans were pretty similar, offering pizzas, hamburgers, 60 to 90 cans of beer, up to 2.5 liters of spirits, and some fun activities.
However, Claude’s suggestions were based on assumptions that the stags will have a backyard with a grill and a big table for a 10-player poker game, which is already above the recommended nine-player limit.
ChatGPT o3 offered mini-golf, which is also not the best option for a large company. Therefore, 4.1 with its Costco pizza and bowling won the laurels.
Find out more about the task and the results in the test section for high processing functions below.
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Place | 1st | 2nd | 3rd | 4th |
Problem solving and analysing
In the problem solving and analysing test, we asked our AI assistants to troubleshoot a Windows laptop that became sluggish for no apparent reason.
Here, ChatGPT o3 was the clear winner. Not only did it offer the most possible causes (9), it did that in a very structured way, ending its answer with an easy-to-follow decision tree. It also added extra tips on what to do if even reinstalling Windows doesn’t help.
Probably the best thing about the o3 answer was that it used web search, without being told to, and found possible recent causes, such as an issue with the Windows Startup Boost released in May 2025.
4.1 did pretty well too, with seven possible causes, but they were pretty general and briefly described.
Sonnet 4’s answer was similar to that of 4.1 – it gave more (9) possible causes but failed to describe how to troubleshoot further with its recommended Linux USB.
Lastly, Opus 4 gave seven causes and six tips on what to try after. However, including “disabling visual effects” is not a serious suggestion for a mid-tier laptop struggling with performance. It also didn’t offer what to do if a Windows reinstall doesn’t help.
Find out more about the task and the results in the test section for problem solving and analysing below.
Explanation
In this test, we asked to explain the string theory in layman’s terms and provide simple analogies where possible. Once more, o3 turned out to be the winner. It offered a well-structured and easy-to-understand explanation that involved not only explanations of “strings” but also other concepts, such as branes or super-symmetry.
Sonnet 4 spitted out the most comprehensive explanation, but explained fewer elements of the theory.
Finally, Opus 4 and ChatGPT 4.1 shared the 3rd spot because each left some key points poorly explained or without a proper analogy.
Find out more about the task and the results in the test section for explanation below.
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Place | 3rd | 1st | 2nd | 3rd |
Image recognition and description
To determine the champion of image analysis, we showed an image of a phonograph generated with Visual Electric.
All four AI assistants had no problems with this task. The answers were clear and based on facts without notable differences. They also correctly recognized the image as generated by an AI.
However, we decided to give the gold medal to o3 for slightly more complete image analysis.
Find out more about the task and the results in the test section for image recognition and description below.
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Place | 2nd | 1st | 2nd | 2nd |
Image generation
ChatGPT is the automatic winner in this category since Claude has yet to offer this feature. The result according to our prompt was good, although the AI missed one key point in our prompt – a golden tooth. Find out more about the task and the results in the test section for image generation below.
ChatGPT automatically won this because Claude doesn’t have a video generation capability yet. To be fair, free ChatGPT doesn’t have it either (check out the ChatGPT free vs paid article for a detailed version comparison).
Image generation prompts
Even though Anthropic’s models cannot create images, one way to compare Claude vs ChatGPT in this area was to see how good their prompts for an image-generating AI could be. We described what we wanted and fed it to our four chatbots.
We didn’t subtract many points if the image generator (we used Visual Electric) didn’t do its best, as the point here was to get the best prompt, not the best image, which can depend a lot on the AI you’re using.
Here, o3 shined, providing a much more detailed description than the other three. It even included technical details of the camera that the AI should simulate, such as an 85mm lens. Additionally, it offered style keywords and negative prompts, which can help a lot when using a proper tool.
4.1 used the most simplistic language, which may be why the resulting image was the closest to what we wanted. On the other hand, it was the first iteration, so the AI might have tried to “upgrade” what it had already done when we tested the other three models.
Sonnet 4 and Opus 4 used more elaborate explanations than 4.1, but they were also relatively brief. However, Opus 4 gave a better style description, finishing just ahead of its less powerful brother-model (or model-brother).
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Place | 4th | 1st | 3rd | 2nd |
Video generation
ChatGPT automatically won this because Claude doesn’t have a video generation capability yet.
We gave Sora, Open AI’s video creation tool, a pretty simple prompt to create a scene from an underground dance club, and it did a good job. While the people in the video didn’t look very realistic, we got all the requested details save for the “DJ nodding to the rhythm.”
Compared to Sora, Google Gemini has a way better video generator in Veo 3, so we’ll probably use it in the future.
Find out more about the task and the results in the test section for video generation below.
Video generation prompts
Just like in the image test, we checked how all four models can write a video generation prompt. Then, we asked Sora to create a 5-second video of a girl riding a bicycle in the 1920s, according to each.
The quality of the prompt was the main criterion, so even when the video turned out to be a little (or pretty) weird, we didn’t subtract many points.
ChatGPT 4.1 produced the best-looking result, although it took some liberties and decided to add elements we didn’t ask for.
To our surprise, the o3 video was the worst, even though the prompt was very detailed, both in technical and atmosphere description.
Sonnet 4 gave the second-best video, and the prompt was shorter but more detailed than that of 4.1.
Finally, Opus 4 gave probably the most detailed prompt that should have provided the best result if the video generation AI could follow each request precisely. Unfortunately, Sora made one big (and a bit funny) mistake that ruined the overall good output.
| | ChatGPT 4.1 | ChatGPT o3 | Sonnet 4 | Opus 4 |
|---|---|---|---|---|
| Place | 4th | 2nd | 3rd | 1st |
Claude vs. ChatGPT: Which one performed better on our tests?
While there are standardized tests to measure how good each AI model is at different tasks, we believe that testing with real cases shows a clearer picture.
For that, we ran eight different tests, namely:
- 1.Creative writing
- 2.Coding
- 3.High processing functions
- 4.Problem solving and analyzing
- 5.Explanation
- 6.Image generation and prompt writing
- 7.Image recognition and description
- 8.Video generation and prompt writing
The main criteria for evaluating these AI chatbots were factual accuracy and the absence of logical mistakes, bugs, or hallucinations. For the writing part, we also tested the ability to crack a joke and the level of literary language. Finally, for the image and video generation, we looked for a result that’s as close to the prompt as possible.
Below are the results of each test, along with prompts and detailed explanations. You can also see all the input and output in this PDF document.




