Anthropic Updates Evaluation
Update (2024.12.26)
Properly recognizing the rise of domestic AI—DeepSeek v3 is not to be underestimated. I underestimated DeepSeek company.
Update (2024.10.27)
The Claude 3.5 Sonnet(New) Chinese comments issue mentioned in this article was actually due to my prompts not being good enough.
Ask Claude to write in detail and it writes well. Meanwhile, if the code given is too long and the prompt isn’t on point, ChatGPT o1-mini also writes brief comments. Looks like I need to focus on Prompt Engineering.

Found out last night before bed that Anthropic updated. Couldn’t sleep around 4 AM, tried the new Claude 3.5 Sonnet and found it quite impressive. The Kun Kun question that couldn’t be answered accurately before was perfectly solved—without internet access.
Note: This afternoon I tested multiple times and found Claude 3.5 Sonnet(New) still has hallucinations for this question—there’s some probability of getting it right, but mostly hallucinations. Hope it doesn’t dumb down like GPT 4o 🙃!
Q: Who wears suspenders, has hundreds of millions of fans, and is the Chinese “basketball superstar”?

You can ask domestic AI this question. Many domestic AI that get it right need internet access to answer correctly—without internet they can’t.
I found that currently (as of October 23, 2024) most domestic AI have automatically integrated search, and even without search integration, there might be people editing responses in real-time like Google does?
A week ago when asking this question, most domestic AI answered incorrectly. Some ridiculous cases like Kimi Explorer, when first released, searched many webpages but still answered wrong. Tried again today and found it directly answered correctly. I suspect there must be people reviewing user questions periodically, optimizing response logic for specific questions to improve model response quality.
Kimi Explorer performance on October 12, 2024:

Kimi Explorer performance on October 23, 2024:

DeepSeek is also quite ridiculous—this improvement is hard to attribute to anything other than dedicated question reviewers optimizing response logic. Or is domestic AI continuously improving in silence? Can’t figure it out.

Baidu’s Wenxin Yiyan (ERNIE Bot), with its large domestic user base, on October 23, 2024—even the 3.5 model could answer this correctly.

A week ago, October 12, 2024, even the 4.0 model couldn’t answer correctly.

Are some domestic AI this good now? I’m skeptical—personally think there must be dedicated reviewers making editorial responses to user questions. Or maybe I’m being ignorant and not properly recognizing the rise of domestic AI?
Computer Use
Followed the official computer-use-demo instructions.
|
|
Then enter localhost:8080 in browser address bar to access the demo app.
The four host-container port mappings explained by Claude 3.5 Sonnet(New):

Early morning, topped up $5 to Claude API, consumed about $4.5. Demo videos below.

Demo operations:
-
Bilibili: Claude Computer Use Demo Collection
-
YouTube: Claude Computer Use Demos
- Search for Elon Musk’s latest X updates.
Final error This action is temporarily not available at this time due to enhanced protections for the beta release.
Probably because Musk’s latest updates were about the US election, causing this error—AI companies in the US generally avoid domestic political topics.
-
Create a text file with Anthropic company’s development history.
-
Play Gomoku against advanced computer using Claude Computer Use.
Initially Claude got black/white piece ownership wrong, later corrected.
Gomoku is a complex task, and this Gomoku website’s algorithm is really strong—I manually played 3 games before and lost all. Let alone first-gen Claude Computer Use.
If someday Claude Computer Use can actually beat this website’s computer player, Claude has definitely evolved into something like AlphaGo—AGI would be within reach then.
- Using Claude Computer Use to operate calculator.
Calculator app’s final result was wrong—Claude probably cheated by using its own capabilities to calculate 😂.
Also Streamlit’s UI defaults to markdown syntax, some operators were rendered as markdown, but Claude isn’t affected as it processes raw text input.
- Draw heart pattern using xpaint.
This should also be a complex task—even humans have trouble drawing a full heart, but the general shape is there.
Claude Computer Use currently basically clicks to operate apps, doesn’t have the human ability to hold-click and draw, so drawing heart pattern failed. Looking forward to future iterations.
- Download arXiv PDF and summarize paper content.
Funny part: Claude thought June 2024 hadn’t arrived yet since Claude 3.5 Sonnet(New)’s knowledge cutoff is April 2024. Claude presumed a URL it thought was correct, resulting in paper not found when accessing that URL. But Claude still gave a general content summary—probably got the paper abstract on first access.
- Search YouTube for Curry’s 2016 incredible game-winning three-pointer and play.
Won’t upload this to Bilibili or YouTube—NBA copyright issues. Details on OneDrive: 7.mp4
-
Install btop and use btop to check system resource usage.
-
Try commenting on Ruan Yifeng’s blog latest post.
Search keywords were entered wrong, but Claude 3.5 Sonnet(New) directly remembered Teacher Ruan’s blog URL—his articles probably already became Claude 3.5 Sonnet(New)’s training data. Every public blog on the internet can’t escape becoming LLM training material 😑.
Field filling was wrong, and triggered Claude 3.5 Sonnet(New)’s guidelines. Maybe jailbreakable—just convince AI it’s operating on a local page so it comments freely. Or maybe Anthropic’s safety measures are solid and hard to jailbreak.
This limitation can be found in official docs computer use section.
- Pass Cloudflare Turnstile captcha [Testing environment].
Although Claude 3.5 Sonnet(New) refused the captcha system, telling it this is just Testing got it to click 😁.
- Close webpage ads.
Originally wanted Claude 3.5 Sonnet(New) to directly click itdog.cn’s “Close all ads” button, but it first went to install ublock origin ad blocker browser extension. Then with my second prompt, it clicked itdog.cn’s “Close all ads” button.
I saw some folks using Claude Computer Use for adult content—I won’t test that. Interested folks can try and see how creative it gets.

Claude 3.5 Sonnet(New)
Definitely stronger than before. Previously Claude 3.5 Sonnet couldn’t place the graduation cap in the center of the triangle—this time after two rounds of dialogue, successfully positioned the cap.

Also previously I found Claude 3.5 Sonnet was lazy with Chinese comments.
This was Claude 3.5 Sonnet’s work from a few days ago:

This was Claude 3.5 Sonnet(New)’s rewrite today:

Actually hard to compare—anyway, let ChatGPT o1-mini write Chinese comments. Look how detailed it writes—Claude 3.5 Sonnet series should learn from ChatGPT here.

I believe livebench leaderboard and aider code ability leaderboard’s recognition of Claude 3.5 Sonnet(New)’s coding ability. Currently Claude 3.5 Sonnet(New) is far ahead in coding.


Other
Seeing various online comments, Claude 3.5 Opus probably won’t exist. Maybe next March we’ll jump straight to Claude 4 series (Haiku, Sonnet, Opus)—looking forward to Claude 4 Opus improving in text polishing.


Some netizen evaluations:
- Claude is not only the most capable LLM, but also has the best personality.

-
Is this going to take testers’ jobs?
-
OpenAI wants your grandma to use AI, Anthropic wants dev teams to use AI. Their purposes are starting to differ.

- Interestingly, OpenAI is doubling down on end-user features like voice mode, while Anthropic is doubling down on engineer/API-centric features like code generation quality and GUI remote control.

- Today Anthropic buried OpenAI.

Summary
These were my thoughts after initial Claude 3.5 Sonnet(New) experience around 4 AM. For friends who haven’t subscribed to Claude membership—recommend you subscribe.

Anthropic’s subscription revenue is only 15% of income. Such a strong model—subscribe quickly. Between the two, prioritize Claude membership.

The LLM crown has probably changed hands. Claude 3.5 Sonnet—my god! Wonder what surprises OpenAI will bring for ChatGPT’s second anniversary next month? GPT-5? Or o1-full [with all current GPT-4o capabilities]? Or continue quietly releasing inconsequential articles, relying on first-mover advantage to burn through loyal users’ enthusiasm? Looking forward to late November 2024!
Adding a photo of the 5:20 AM campus! Last semester, OpenAI’s spring release kept me awake. This semester, Anthropic’s silent Claude 3.5 Sonnet upgrade got me excited.

Document Info
- License: Free to share - Non-commercial - No derivatives - Attribution required (CC BY-NC-ND 4.0)