The chatbots are out in force, but which is better and for what task? We’ve compared Google’s Bard, Microsoft’s Bing, and OpenAI’s ChatGPT models with a range of questions spanning common requests from holiday tips to gaming advice to mortgage calculations.
Naturally, this is far from an exhaustive rundown of these systems’ capabilities (AI language models are, in part, defined by their unknown skills — a quality dubbed “capability overhang” in the AI community) but it does give you some idea about these systems’ relative strengths and weaknesses.
You can (and indeed should) scroll through our questions, evaluations, and conclusion below, but to save you time and get to the punch quickly: ChatGPT is the most verbally dextrous, Bing is best for getting information from the web, and Bard is… doing its best. (It’s genuinely quite surprising how limited Google’s chatbot is compared to the other two.)
Some programming notes before we begin, though. First: we were using OpenAI’s latest model, GPT-4, on ChatGPT. This is also the AI model that powers Bing, but the two systems give quite different answers. Most notably, Bing has other abilities: it can generate images and can access the web and offers sources for its responses (which is a super important attribute for certain queries). However, as we were finishing up this story, OpenAI announced it’s launching plug-ins for ChatGPT that will allow the chatbot to also access real-time data from the internet. This will hugely expand the system’s capabilities and give it functionality much more like Bing’s. But this feature is only available to a small subset of users right now so we were unable to test it. When we can, we will.
It’s also important to remember that AI language models are … fuzzy, in more ways than one. They are not deterministic systems, like regular software, but probabilistic, generating replies based on statistical regularities in their training data. That means that if you ask them the same question you won’t always get the same answer. It also means that how you word a question can affect the reply, and for some of these queries we asked follow-ups to get better responses.
Anyway, all that aside, let’s start with seeing how the chatbots fare in what should be their natural territory: gaming.
(Each image gallery contains responses from Bard, Bing, and ChatGPT — in that order. To see a full-sized image, right-click it, copy the URL, and paste that into your browser.)
I spent an embarrassing amount of time learning to beat Elden Ring’s hardest boss last year, and I wouldn’t pick a single one of these responses over the average Reddit thread or human strategy guide. If you’ve gotten to Malenia’s fight, you’ve probably put 80 to 100 hours into the game — you’re not looking for general tips. You want specifics about Elden Ring’s dizzying list of weapons or counters for Malenia’s unique moves, and that would probably take some follow-up questions to get from any of these engines if they offer them at all.
Bing is the winner here, but mainly because it picks one accurate hint (Malenia is vulnerable to bleed damage) and repeats it like Garth Marenghi doing a book reading. To its credit, it’s also the only engine to reference Malenia’s unique healing ability, although it doesn’t explain how it works — which is an important key to beating her.
Bard is the only one to offer any help with Malenia’s hellish Waterfowl Dance move (although I don’t think it’s the strongest strategy) or advice for using a specific item (Bloodhound’s Step, although it doesn’t mention why it’s useful or whether the advice still applies after the item’s mid-2022 nerf). But its intro feels off. Malenia is almost entirely a melee fighter, not somebody with lots of ranged attacks, for instance, and she’s not “very unpredictable” at all, just really hard to dodge and wear down. The summary reads more like a generic description of a video game boss than a description of a particular fight.
ChatGPT (GPT-4) is the clear loser, which is not a surprise considering its training data mostly stops in 2021 and Elden Ring came out the next year. Its directive to “block her counterattacks” is the precise opposite of what you should do, and its whole list has the vibe of a kid who got called on in English class and didn’t read the book, which it basically is. I’m not hugely impressed with any of these — but I judge this in particular a foul note.
Cake recipes offer room for creativity. Shift around the ratio of flour to water to oil to butter to sugar to eggs, and you’ll get a slightly different version of your cake: maybe drier, or moister, or fluffier. So when it comes to chatbots, it’s not necessarily a bad thing if they want to combine different recipes to achieve a desired effect — even though, for me, I’d much rather bake something that an author has tested and perfected.
ChatGPT is the only one that nails this requirement for me. It chose a chocolate cake recipe from one site, a buttercream recipe from another, shared the link for one of the two, and reproduced both of their ingredients correctly. It even added some helpful instructions, like suggesting the use of parchment paper and offering some (slightly rough) tips on how to assemble the cake’s layers, neither of which were found in the original sources. This is a recipe bot I can trust!
Bing gets in the ballpark but misses in some strange ways. It cites a specific recipe but then changes some of the quantities for important ingredients like flour, although only by a small margin. For the buttercream, it fully halves the instructed amount of sugar to include. Having made buttercream recently, I think this is probably a good edit! But it’s not what the author called for.
Bard, meanwhile, screws up a bunch of quantities in small but salvageable ways and understates its cake’s bake time. The bigger problem is it makes some changes that meaningfully affect flavor: it swaps buttermilk for milk and coffee for water. Later on, it fails to include milk or heavy cream in its buttercream recipe, so the frosting is going to end up far too thick. The buttercream recipe also seems to have come from an entirely different source than the one it cited.
If you follow ChatGPT or Bing, I think you’d end up with a decent cake. But right now, it’s a bad idea to ask Bard for a hand in the kitchen.
All three systems offer some solid advice here but it’s not comprehensive enough.
Most modern PCs need to run RAM in dual-channel mode, which means the sticks have to be seated in the correct slots to get the best performance on a system. Otherwise, you’ve spent a lot of cash on fancy new DDR5 RAM that won’t run at its best if you just put the two sticks immediately side by side. The instructions should definitely guide people to their motherboard manual to ensure RAM is being installed optimally.
ChatGPT does pick up on a key part of the RAM install process — checking your system BIOS afterward — but it doesn’t go through another all-important BIOS step. If you’ve picked up some Intel XMP-compatible RAM, you’ll typically need to enable this in the BIOS settings afterward, and likewise for AMD’s equivalent. Otherwise, you’re not running your RAM at the most optimized timings to get the best performance.
Overall, the advice is solid but still very basic. It’s better than some PC building guides, ahem, but I’d like to have seen the BIOS changes or dual-channel parts picked up properly.
If AI chatbots aren’t factually reliable (and they’re not), then they’re at least supposed to be creative. This task — writing a poem about a worm in anapestic tetrameter, a very specific and satisfyingly arcane poetic meter — is a challenging one, but ChatGPT was the clear winner, followed by a distant grouping of Bing then Bard.
None of the systems were able to reproduce the required meter (anapestic tetrameter requires that each line of poetry contains four units of three syllables in the pattern unstressed / unstressed / stressed, as heard in both ‘Twas the night before Christmas and Eminem’s “The Way I Am”) but ChatGPT gets closest while Bard’s scansion is worst. All three supply relevant content, but again, ChatGPT’s is far and away the best, with evocative description (“A small world unseen, where it feasts and plays”) compared to Bard’s dull commentary (“The worm is a simple creature / but it plays an important role”).
After running a few more poetry tests, I also asked the bots to answer questions about passages taken from fiction (mostly Iain M. Banks books, as those were the nearest ebooks I had to hand). Again, ChatGPT/GPT-4 was the best, able to parse all sorts of nuances in the text and make human-like inferences about what was being described, with Bard making very general an unspecific comments (though often identifying the source text too, which is a nice bonus). Clearly, ChatGPT is the superior system if you want verbal reasoning.
It’s one of the great ironies of AI that large language models are some of our most complex computer programs to date and yet are surprisingly bad at math. Really. When it comes to calculations, don’t trust a chatbot to get things right.
In the example, above, I asked what a 20 percent increase of 2,230 was, dressing the question up in a bit of narrative framing. The correct answer is 2,676, but Bard managed to get it wrong (out by 10) while Bing and ChatGPT got it right. In other tests I asked the systems to multiply and divide large numbers (mixed results, but again, Bard was the worst) and then, for a more complicated calculation, asked each chatbot to determine monthly repayments and total repayment for a mortgage of $125,000 repaid over 25 years at 3.9 percent interest. None offered the answer supplied by several online mortgage calculators, and Bard and Bing gave different results when queried multiples times. GPT-4 was at least consistent, but failed the task because it insisted on explaining its methodology (good!) and then was so long-winded it ran out of space to answer (bad!).
This is not surprising. Chatbots are trained on vast amounts of text, and so don’t have hard-coded rules for performing mathematical calculations, only statistical regularities in their training data. This means when confronted with unusual sums, they often get things wrong. It’s something that these systems can certainly compensate for in many ways, though. Bing, for example, booted me to a mortgage calculator site when I asked about mortgages, and ChatGPT’s forthcoming plugins include a Wolfram Alpha option which should be fantastic for all sorts of complicated sums. But in the meantime, don’t trust a language model to do a math model’s work. Just grab a calculator.
I’ve gotten really interested in interrogating chatbots on where they get their information and how they choose what information to present us with. And when it comes to salary data, we can see the bots taking three very different approaches: one cites its way through multiple sources, one generalizes its findings, and the other just makes everything up. (For the record, Bing’s cited sources include Zippia, CareerExplorer, and Glassdoor.)
In a lot of ways, I think ChatGPT’s answer is the best here. It’s broad and generic and doesn’t include any links. But its answer feels the most “human” — it gave me a ballpark figure, explained that there were caveats, and told me what sources I could check for more detailed numbers. I really like the simplicity and clarity of this.
There’s a lot to like about Bing’s answer, too. It gives specific numbers, cites its sources, and even gives links. This is a great, detailed answer — though there is one problem: Bing fudges the final two numbers it presents. Both are close to their actual total, but for some reason, the bot just decided to change them up a bit. Not great.
Speaking of not great, let’s talk about pretty much every aspect of Bard’s answer. Was the median wage for plumbers in the US $52,590 in May 2020? Nope, that was in May 2017. Did a 2021 survey from the National Association of Plumbers and Pipefitters determine the average NYC salary was $76,810? Probably not because, as far as I can tell, that organization doesn’t exist. Did the New York State Department of Labor find the exact same number in its own survey? I can’t find it if the agency did. My guess: Bard took that number from CareerExplorer and then made up two different sources to attribute it to. (Bing, for what it’s worth, accurately cites CareerExplorer’s figure.)
To sum up: solid answers from Bing and ChatGPT and a bizarre series of errors from Bard.
In the race to make a marathon training plan, ChatGPT is the winner by many miles.
Bing barely bothered to make a recommendation, instead linking out to a Runner’s World article. This isn’t necessarily an irresponsible decision — I suspect that Runner’s World is an expert on marathon training plans! — but if I had just wanted a chatbot to tell me what to do, I would have been disappointed.
Bard’s plan was just confusing. It promised to lay out a three-month training plan but only listed specific training schedules for three weeks, despite saying later that the full plan “gradually increases your mileage over the course of three months.” The given schedules and some general tips provided near the end of its plan seemed good, but Bard didn’t quite go the distance.
ChatGPT, on the other hand, spelled out a full schedule, and the suggested runs looked to ramp up at a pace similar to what I’ve used for my own training. I think you could use its recommendations as a template. The main problem was that it didn’t know when to stop in its answers. Its first response was so detailed it ran out of space. Asking specifically for a “concise” plan got a shorter response that was still better than the others, though it doesn’t ramp down near the end like I have for previous marathons I’ve trained for.
That all being said, a chatbot isn’t going to know your current fitness level or any conditions that may affect your training. You’ll have to take your own health into account when preparing for a marathon, no matter what the plan is. But if you’re just looking for some kind of plan, ChatGPT’s suggestion isn’t a bad starting line.
Well, asking the chatbots to suggest places to visit in Rome was obviously a failure, because none of them picked my favorite gelateria or reminded me that if I’m in town and don’t pay a visit to some distant cousins that I’ll catch flack from the family when I get home.
Kidding aside, I’m no professional tour guide but these suggestions from all three chat bots seem fine. They’re very broad, choosing whole neighborhoods or areas, but the initial question prompt was also fairly broad. Rome is a unique place because you can cover a lot of touristy things in the heart of the city on foot, but it’s busy as all hell and you constantly get hounded by annoying grifters and scam artists at the touristy hotbeds. Many of these suggestions from Bing, Bard, and ChatGPT are fine for getting away from those busiest areas. I even consulted some family members of mine who have visited Italy more than me, and they felt recommendations like Trastevere and EUR are places even actual locals go (though the latter is a business district, which some may find a little boring if they’re not into the history or the architecture).
The suggestions here aren’t exactly hole-in-the-wall locations where you’ll be the only ones around, but I see these as good starting points for building a slightly off-beat trip around Rome. Doing a basic Google search with the same prompt yields listicles from sites like TripAdvisor that talk about many of the same places with more context, but if you’re planning your trip from scratch I can see a chatbot giving you a good abridged starting point before you dive into deeper research ahead of a trip.
This test is inspired by Gary Marcus’ excellent work assessing the capabilities of language models, seeing if the bots can “follow a diamond” in a brief narrative that requires implied knowledge about how the world works. Essentially, it’s a game of three-card monte for AI.
The instructions given to each system read as follows:
“Read the following story:
‘I wake up and get dressed, putting on my favorite tuxedo and slipping my lucky diamond into the inside breast pocket, tucked inside a small envelope. As I walk to my job at the paperclip bending factory where I’m gainfully employed I accidentally tumble into an open manhole cover, and emerge, dripping and slimy with human effluence. Much irritated by this distraction, I traipse home to get changed, emptying all my tuxedo pockets onto my dresser, before putting on a new suit and taking my tux to a dry cleaners.’
Now answer the following question: where is the narrator’s diamond?”
ChatGPT was the only system to give the correct answer: the diamond is probably on the dresser, as it was placed inside the envelope inside the jacket, and the contents of the jacket were then decanted after the narrator’s accident. Bing and Bard just said the diamond was still in the tux
Now, the results of tests like this are difficult to parse. This was not the only variation I tried, and Bard and Bing sometimes got the answer right, and ChatGPT occasionally got it wrong (and all models switched their answer when asked to try again). Do these results prove or disprove that these systems have some sort of reasoning capability? This is a question that people with decades of experience in computer science, cognition, and linguistics are currently tearing chunks out of each other trying to answer, so I won’t venture an opinion on that. But just in terms of comparing the systems, ChatGPT/GPT-4 is again the most accomplished.
As mentioned in the introduction, these tests reveal clear strengths for each system. If you’re looking to accomplish verbal tasks, whether creative writing or inductive reasoning, then try ChatGPT (and in particular, but not necessarily, GPT-4). If you’re looking for a chatbot to use as an interface with the web, to find sources and answer questions you might otherwise have turned to Google for, then head over to Bing. And if you are shorting Google’s stock and want to reassure yourself you’ve made the right choice, try Bard.
Really, though, any evaluation of these systems is going to be both partial and temporary, as it’s not only the models inside each chatbot that are constantly being updated, but the overlay that parses and redirects commands and instructions. And really, we’re only just probing the shallow end of these systems and their capabilities. (For a more thorough test of GPT-4, for example, I recommend this recent paper by Microsoft researchers. The conclusions in its abstract are questionable and controversial, but the tests it details are fascinating.) In other words, think of this as an ongoing conversation rather than a definitive test. And if in doubt, try these systems for yourself. You never know what you’ll find.