How well do AI tools write code? Over the past year or so, I’ve been putting large language models through a series of tests to see how well they handle some fairly basic programming challenges.
Also: The best free AI courses (and whether AI ‘micro-degrees’ and certificates are worth it)
The idea is simple: if they can’t handle these basic challenges, it’s probably not worth asking them to do anything more complex. On the other hand, if they can handle these basic challenges, they might become helpful assistants to programmers looking to save some time.
To set this benchmark, I’ve been using three tests (and just added a fourth). They are:
- Writing a WordPress plugin:This tests basic web development using the PHP programming language, inside of WordPress. It also requires a bit of user interface building. If an AI chatbot passes this test, it can help create rudimentary code as an assistant to web developers. I originally documented this test in “I asked ChatGPT to write a WordPress plugin I needed. It did it in less than 5 minutes.”
- Rewriting a string function: This test evaluates how an AI chatbot updates a utility function for better functionality. If an AI chatbot passes this test, it might be able to help create tools for programmers. If it fails, first-year programming students can probably do a better job. I originally documented this test in “OK, so ChatGPT just debugged my code. For real.”
- Finding an annoying bug: This test requires intimate knowledge of how WordPress works because the obvious answer is wrong. If an AI chatbot can answer this correctly, then its knowledge base is pretty complete, even with frameworks like WordPress. I originally documented this test in “OK, so ChatGPT just debugged my code. For real.”
- Writing a script: This test asks an AI chatbot to program using two fairly specialized programming tools not known to many users. It essentially tests the AI chatbot’s knowledge beyond the big languages. I originally documented this test in “Google unveils Gemini Code Assist and I’m cautiously optimistic it will help programmers.”
I’m going to take you through each test and compare the results to those of the other AI chatbots that I’ve tested. That way, you’ll be better able to gauge how AI chatbots differ when it comes to coding performance.
This time, I’m putting Meta’s new Meta AI to the test. Let’s get started.
1. Writing a WordPress plugin
Here’s the Meta AI-generated interface on the left, compared to the ChatGPT-generated interface on the right:
Both AI chatbots generated the fields required, but ChatGPT’s presentation was cleaner, and it included headings for each of the fields. ChatGPT also placed the Randomize button in a more appropriate location given the functionality.
Also: How to get started with Meta AI in Facebook, Instagram, and more
In terms of operation, ChatGPT took in a set of names and produced randomized results, as expected. Unfortunately, Meta AI took in a set of names, flashed something, and then presented a white screen. This is commonly described in the WordPress world as “The White Screen of Death.”
Here are the aggregate results of this and previous tests:
- Meta AI: Interface: adequate, functionality: fail
- Meta Code Llama: Complete failure
- Google Gemini Advanced: Interface: good, functionality: fail
- ChatGPT: Interface: good, functionality: good
2. Rewriting a string function
This test is designed to test dollars and cents conversions. Meta AI had four main problems: it made changes to correct values when it shouldn’t have, didn’t properly test for numbers with multiple decimal points, completely failed if a dollar amount had less than two decimals (in other words, it would fail with $5 or $5.2 as inputs), and rejected correct numbers once processing was completed because it formatted those numbers incorrectly.
Also: How to use ChatGPT
This is a fairly simple assignment and one that most first-year computer science students should be able to complete. It’s disappointing that Meta AI failed, especially since Meta’s Code Llama succeeded with the same test.
Here are the aggregate results of this and previous tests:
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Advanced: Failed
- ChatGPT: Succeeded
3. Finding an annoying bug
This isn’t a programming assignment. This test takes in some pre-existing chunks of code, along with error data and a problem description. It then asks the AI chatbot to figure out what’s wrong with the code and recommend a fix.
The challenge here is that there is an obvious answer, which is wrong. The problem requires some deep knowledge in how the WordPress API works, as well as understanding the interplay between various components of the program being written.
Meta AI passed this one with flying colors. Not only did it identify the error correctly, it even made a suggestion that, while not necessary, improved the efficiency of the code.
After failing so miserably on rewriting a simple string function, I did not expect Meta AI to succeed on a substantially more challenging problem. This goes to show that AI chatbots are not necessarily consistent in their responses.
Here are the aggregate results of this and previous tests:
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Advanced: Failed
- ChatGPT: Succeeded
4. Writing a script
This test requires coding knowledge of the MacOS scripting tool Keyboard Maestro, Apple’s scripting language AppleScript, and Chrome scripting behavior.
Keyboard Maestro is an amazingly powerful tool (it’s one of the reasons I use Macs as my primary work machines), but it’s also a fairly obscure product written by a lone programmer in Australia. If an AI chatbot can code using this tool, chances are it has decent coding knowledge across languages. AppleScript is Apple’s MacOS scripting language, but it’s also fairly obscure.
Also: The best AI image generators: Tested and reviewed
Both Meta AI and Meta’s Code Llama failed in exactly the same way: they did not retrieve data from Keyboard Maestro as instructed. Neither seemed to know about the tool at all. By contrast, both Gemini and ChatGPT knew it was a separate tool, and retrieved the data correctly.
Here are the aggregate results of this and previous tests:
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Advanced: Succeeded
- ChatGPT: Succeeded
Overall results
Here are the overall results of the four tests:
I have used ChatGPT to help with coding projects now for about six months. Nothing in the results here have convinced me to switch to a different AI chatbot. In fact, if I used any of these AI chatbots, I’d be concerned that I might be spending more time checking and finding errors than getting the work done.
I’m disappointed with the other large language models. My tests show that ChatGPT is still the undisputed coding champion, at least for now.
Have you tried coding with Meta AI, Gemini, or ChatGPT? What has your experience been? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.
+ There are no comments
Add yours