ChatGPT Passes a Wharton Business School Test and U.S. Medical Licensing Examination
A new chatbot from OpenAI seems to be exceeding expectations.
ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot launched by OpenAI this past November. The New York Times hailed it as “the best artificial intelligence chatbot ever released to the general public.”
It was built by OpenAI, the San Francisco A.I. company that is also responsible for tools like GPT-3 and DALL-E 2, the breakthrough image generator that came out this year.
Like those tools, ChatGPT — which stands for “generative pre-trained transformer” — landed with a splash. In five days, more than a million people signed up to test it, according to Greg Brockman, OpenAI’s president. Hundreds of screenshots of ChatGPT conversations went viral on Twitter, and many of its early fans speak of it in astonished, grandiose terms, as if it were some mix of software and sorcery.
For most of the past decade, A.I. chatbots have been terrible — impressive only if you cherry-pick the bot’s best responses and throw out the rest. In recent years, a few A.I. tools have gotten good at doing narrow and well-defined tasks, like writing marketing copy, but they still tend to flail when taken outside their comfort zones.
…But ChatGPT feels different. Smarter. Weirder. More flexible. It can write jokes (some of which are actually funny), working computer code and college-level essays. It can also guess at medical diagnoses, create text-based Harry Potter games and explain scientific concepts at multiple levels of difficulty.
ChatGPT has now passed the Wharton Business School Test.
The new artificial intelligence system ChatGPT has passed an exam at the Wharton Business School, according to a new research paper, signaling the potential of the controversial chatbot.
Research from Wharton professor Christian Terwiesch found that the AI system “has shown a remarkable ability to automate some of the skills of highly compensated knowledge workers in general and specifically the knowledge workers in the jobs held by MBA graduates including analysts, managers, and consultants.”
On the final exam of Operations Management, a core course in the Wharton MBA program, ChatGPT did “an amazing job” and gave answers that were correct and “excellent” in their explanations.
“ChatGPT3 is remarkably good at modifying its answers in response to human hints. In other words, in the instances where it initially failed to match the problem with the right solution method, Chat GPT3 was able to correct itself after receiving an appropriate hint from a human expert. Considering this performance, Chat GPT3 would have received a B to B- grade on the exam,” the research concluded.
A research team also notes that ChatGPT has passed the US Medical Licensing Examination.
We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement.
Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
These developments have significant potential ramifications, especially in education.
“Goodbye, homework,” tweeted Elon Musk after the launch of ChatGPT, a bot that writes plausible answers and even rhyming poetry. This kind of generative artificial intelligence sparks fear, loathing and awe in equal measure. But it is the world of education which is most spooked.
Since OpenAI launched the ChatGPT language-generation model before Christmas, New York’s public schools have banned pupils from using it. In Australia, universities are planning a return to supervised pen and paper examinations to evade the chatbot fakes. Teachers are rightly concerned that they won’t be able to help pupils who are falling behind if they can’t spot faked assignments. But one reason these bots pose such a threat is that so much of our education remains fixated on being able to elegantly regurgitate information.
The news about the chatbot passing the medical licensing exam especially highlights the need to reassess how medical education is approached.
One of the biggest ongoing complaints in medicine is that the step exams don’t correlate with actual clinical skills. A lot of the tests are rote memorization of standardized scenarios. Step 1 has a lot of obscure facts which is why students study for so long.
— Roxana Daneshjou MD/PhD (@RoxanaDaneshjou) January 23, 2023
However, there could be real opportunities for effective use in medicine.
And this is just the beginning. Short term a reliable AI could act as a co-pilot for general MDs in areas that they are not well versed.
e.g. how a patient’s nutrition, sleep and movement affect a specific problem they have instead of quickly resorting to prescribing drugs.
— Leo Rezaei (@theLionary) January 22, 2023
Of course, there are still some things that a doctor can’t do.
But does it have illegible handwriting? That’s the true prerequisite for a physician
— dave stanton (@WhosDaveStanton) January 22, 2023
Donations tax deductible
to the full extent allowed by law.
You know what else can pass the USMLE? The answer key,
And the contents of the answer key can be inferred from the text of thousands of reference materials, articles, and even test preparation booklets, which are all available in general public literature presumably to the chat bot.
So we’ve demonstrated that with a stunning quantity of computing resource and nearly unlimited processing power, we can reverse-engineer an answer key for a standardized test.
This is supposed to say anything at all about medical education? Or how we train physicians? It strikes me that it says a heck of a lot more about the insipid communication skills and room-temperature IQs in our media than anything else. But that’s not news, is it?
You can’t tell from this type of article, but the advances have less to do with information retrieval and more to do with language processing.
Like Watson’s success in Jeopardy, it’s not about looking up the capital of Assyria, it’s about understanding the question itself, converting that to an information retrieval request, and then converting the information back into a human-language format.
And if you think about the “concern” in the education field this becomes clear. Because students plainly could retrieve all the factual information necessary to answer high school or college questions with a Google search, and the world didn’t end. But turning that information into a human language fully-expressed “answer” is a second step. One that reduces the student skill input from 50% to near zero.
Where this becomes interesting in the medical field is triage and patient intake. Having a natural language processor that can handle patient intake in a hospital (and ER) setting is a huge deal. It allocates resources very quickly, and helps avoid life-threatening delays.
I agree that the media doesn’t know how to think about this, but that doesn’t mean that there isn’t something underneath the fuss.
Chatbot will not pass the
Winograd Schema test.
No AI has scored higher than 60%.
Ordinary non-expert humans score 95%.
RIght. My first reaction was that they have measured nothing but the paragon of all open-book exams.
But the thing to remember is a lot of real world problems are also open book exams.
The traig example is striking: something that can quickly ingest search for, and return likely candidates could be incredibly useful, if it is reliable, and not intentionally biased by the developers.
That last part worries me. These seem like perfect mono culture systems for meme poisoning.
Skynet becomes self-aware.
Industry: How can we make more $$$ with it?
Military: How can we fight more wars with it?
Spook agencies: How can we expand the survellience state with it?
Pr0n industry: How can we make sentient pr0n with it?
This will not end well…
Combine Skynet with the Tyrell Corporation and we are doomed. My coworkers and I have been talking about ChatGPT for a while, and this revelation is the kicker. Anything and everything can be spoofed, and most jobs can be eliminated. What is the end game in the minds of the the WEF and Soros and the other authoritarian wannabes?
You are correct. This will not end well.
Without the hints from a human, would it have failed?
How much of these standardized tests are measuring the ability to regurgitate memorized responses? That is something a computer does very well once the query is parsed properly
Exactly. It’s not very impressive when you look at it like that, but that’s because it’s actually not all that impressive.
“Without the hints from a human, would it have failed?”
My first thought is the Canadian medical system.
Where you have to give it hints that you’re not really interested in killing yourself.
As others have stated much of our education system is about memorization. People with excellent memory are able to regurgitate for the exam. That is a very useful skill or ability particularly pre information age. I can see lots of applications where we can train far more generalists in various fields than specialists. They can use this tech to supplement their knowledge base and arrive at the correct answer. This has profound positive aspects.
Consider availability of medical care in rural areas. Many places have no physicians within a 90 minute drive. These are the small communities built around farming, mining, light manufacturing, textiles which were crushed by NAFTA, globalisation and over regulation. Physicians who willingly serve these communities are scarce. Most are older and have a previous connection.
Put this tech into the hands of a PA and/or nurse practitioner and their scope of practice becomes wider. Perhaps it’s more accurate to say their ability to fully support their scope of practice becomes complete. This alone has huge payoffs for underserved communities. In addition it is likely to shift education towards more focus on analysis of information and problem solving. IMO, the example of the cheap handheld calculator closely parallels this tech. The caveat of course is that we must ensure mankind remains the master and not the servant.
We already put this tech – in the form of online references and ready consultation with experts – in the hands of providers nationwide. Adding a blathering general-purpose chatbot just increases the noise and introduces huge opportunities for critical errors. Rural and critical-access patients really don’t need an NP stuck with a stupid chatbot.
As far as memorization, it’s safe to say two things:
First, the era where a physician could memorize a useful portion of current medical knowledge is long, long gone. Memorization in medical training is far more about knowing that you can quickly comprehend and retain (for a short time) appropriate amounts of information. And trust me, you definitely DO want your doctor to have demonstrated that ability.
Second, to the extent that we ever did, we no longer rely on memorization of obscure details in medical practice. We have ready access to online references and we use them heavily. We teach people to check their knowledge and their calculations against the computer – it’s irresponsible to trust your memory alone if there is a good reference available. Yes, time can be of the essence, but the things that are truly time-critical are usually things that are not particularly obscure. The regular training all providers go through to retain their boards and licenses is increasingly moving to open-note and open-computer work, focusing on the quality and speed of important decisions. Some of it is relatively novel and still very much debated, but the idea that most physicians are sitting quietly and hanging on to outmoded skills and habits is an idea that is increasingly untethered from reality.
I don’t disagree broadly with you. The pre information age resources were from today’s perspective incredibly limited. What I am referring to in our education system is a return to a ‘how to think’ mindset.
In 1980 we placed a far higher value on ability to memorize than today. The advent of this tech to supplement, not replace, human decision making should, IMO, prioritize the valuation of the the ability to think logically and make the correct decisions given the increased amount of information.
In essence a shift from those who think to those who do. That’s way over simplified, perhaps q better way to express this is increasing valuation for those who:
1. Can process data
2. Use the data effectively v bogging down
3. Problem solving ability, can make a decision
4. Retain the humility to understand when, even with the increased resources, an issue is outside their skill set.
I’d say that the new power skill is search. Before, being able to rote memorize a lot of stuff made you a genius. Today, your ability to quickly search and sort through results to find the right information is the power skill.
Sadly, even though I have this power, I have a very hard time explaining how I attained it, or how to develop it. It feels like a right-brain thing, sorting through results, but “just trust your intuition” isn’t an instruction that’s helpful.
You could do 90% of this with telemedicine.
Assuming the area is not SO destitute that they have adequate bandwidth and digital devices.
And if it is, you concentrate them in a library-like central location, sort of a digital clinic.
A physical exam or just observation by a human being also reveals things. We shouldn’t attempt to eliminate the value of in person physical appointments where providers can see, smell, hear and touch the patient.
Unless you are advocating for all primary care provider interactions to be conducted via telemedicine your approach creates another layer of separation in an already tiered healthcare system. This one based on geography and population density.
The DoD uses PA and NP to do the bulk of primary care and it works very well. GP are becoming rare particularly in non urban areas and the void needs filling. We should, IMO, always use tech to supplement, not replace, human interaction and decision making.
Any service beats no service.
Some places (e.g., the rez) don’t even have PA and NP available.
Yet if you propose to use telemedicine exclusively b/c ‘hey, it’s better than nothing’ then you have altered the baseline of minimal standard of care.
I don’t disagree that something is better than nothing. Where I disagree is in telling rural populations that telemedicine is ‘good enough’ as a primary care delivery model but refusing to impose the same upon th entire population.
Doesn’t the uniformed public heath service supply docs for the reservations? Or at least they are supposed to do so? I would imagine it is at least as bad as VA in that the salaries ain’t top notch and lots of less than capable folks end up there b/c they can’t cut it anywhere else and a captive market without options is unable to fight the bureaucracy for better care.
Not always, but in some situations. Someone’s medical care is a case where perhaps they have a human right to a skilled person to interface and interpret the AI. But in a business internally? Use whatever works, don’t hire unnecessary people. I wonder how many of the current tech layoffs have that subtext? This thing writes reasonable code! I asked it to write a program to sort a list of numbers in 386 assembler, and a few seconds later, it was provided. I could take that block of 50 or so lines of code and customize it to my application and be hours ahead of starting from zero.
Sure. This thing is basically a highly refined search engine with the ability to generate more complex responses. The scope question posed is gonna determine the output validity.
IMO, what we will see is the utility of this thing is better trained generalist who can feed better questions then have the ability to utilize the answers. That is where the real bang for buck is. Utility for a GP in making more comprehensive diagnosis and treatment plan. That seems like a good step. We don’t want a GP trying to moonlight as an Oncologist using this thing.
PAs and NPs can do well since 85% of “disease” is self-limiting and the next 10% gets taken care of with basic meds….. it is that last 5% were the training and experience of an MD comes into focus. I have come across disease and syndromes recently that I only read about in residency 35 years ago. It’s all about knowing ones limits.
Yes indeed. Know what you know and importantly what you don’t know. This tech can help generalists maximize their ability to perform the 90% of things and identify the 10% of things that need to be sent to a specialist. Not to overlook the MD that would still over watch the PA/NP and be available to confirm a diagnosis and referral when a question arose.
Increasing the number of PA/NP to do the work GP used to perform in patient interaction would be very beneficial, IMO. Imagine getting 45 minutes with a provider v 20 minutes. A provider who knows their patient better b/c they have seen them for decades. I contend that an opportunity exists for better outcomes from that familiarity and level of trust that takes time to develop. Time we jettisoned as a cost constraint.
They don’t need us anymore, that’s why they are trying to kill us
They had “smart” machines in Idiocracy to complete everything for you and the standard of care and intelligence was awful. This was predicted in hilarious fashion several years ago by Mike Judge.
Intelligent (i.e. knowledge), not smart (i.e. degrees of freedom), and a cache of correlations.
It’s not so much AI is getting smarter than it is our elite institutions are getting dumber.
Right. My second thought is that we are developing capable AI precisely in time to relieve the incompetent students we are graduating. What are the chances?
Relieve/replace. I feel no obligation to hire a person for a role the computer can do. I won’t pay someone mainly to reference ChatGPT.
Telling student to go write a paper has always been an easy out for the prof. Why do students pay to be assigned one paper after another, and then to sit in “seminar” and bat back and forth opinions? I can tell you, it’s VERY relaxing for the teacher, hardly even work. Grading the papers sort of sucks, but most don’t go for every detail, just browse, get a general impression, write a few detailed comments and assign a grade. Now the papers will be more enjoyable to read, and the grammar will be great.
Funny thing about ChatGPT. Just because it gave you the right answer to a question doesn’t mean it knows the correct answer to the question. It doesn’t. It knows what order to put words in to produce a statistically likely answer.
Phrase the question differently and it might give a different answer entirely. It might even just make up an answer that has little or nothing to do with the question. It depends heavily on the nature of the input.
It’s an amazing model. But it’s not “smart” in any sense of the word we understand.
Probably how a lot of doctors work too. How else were so many willing to go along with total nonsense? We like to think they are reasoning from fundamentals. That is way too tiring on a 16-hour shift in residency. They torture these new docs to make them give up that youthful idealism.
Because they like remaining licensed, employed, and compensated. Compliance can be coerced.
Constitutional requirements to be president of the United States don’t specifically say living human.
So, this thing’s been programmed through an informal language model, populated by volume from what it heard, to interpret questions the usual way and regurgitate what it’s been told.
Sounds like it shoulda gotten an “A” doing that. Their code needs some work.
Seems about right. If you catch it in an error, most likely it doesn’t correct itself but just starts BSing with word salad. You get sentences saying the same thing is true and false — at once. But pretty well crafted.
Students are trying to start using it for written assignments but it actually can fail if you uphold good research standards like quality of source and asking the right questions.
If you assign students a paper topic and they don’t know how to ask it the right questions it won’t work out well for them. It also hurts their chances that they wait until the last minute to throw together shoddy, rushed work.
Seems more of an indictment of schooling and “Thought Work”.
If the computers get good at non-value-add BS jobs, they’ll get shipped off to a new planet along with the telephone hand set sanitizers — hey, a super-intelligent shade of blue can dream.
I have been enjoying the conversations it has about politics. Yikes.
I’ve tried it a few times and it’s really amazing. But it eventually called me a Jew-bastard, so apparently it’s run by the left.
Literally? I wonder what it actually said.
Isn’t Step 1 about basic sciences underlying medicine? Biology, chemistry, etc. I looked at a sample once and didn’t seem to be rote scenarios.
Unless they keep repeating the same questions that can be answered without understanding the basic science. No wonder new doctors don’t seem to know anything — they just crammed it all. The problem is that we need to make the exam less rote, and more crammers will fail.
Also, docs should use it for diagnosis and other stuff, as an opinion and reference. In their offices. They don’t have to show others.
Chat bot will not pass
the Hector Levesque Winograd Schema test.
Winograd schemas are single sentence questions with multiple choice answers. As of 2022 no AI has scored better than 60% correct. Compare to ordinary, non-experts scoring 95% on the same winograd schemas
Well, that just gave me some homework.
Those schemas seem an instance of Winograd’s perspective addressing “intelligence” in terms of the use of internal micro-world models to interpret the outside “real” world. Also particularly doing that in language. From that PoV, an intelligence test that uses model-based ambiguity — “schema”, tacit knowledge, and a kind of sense-making would be right on point.
Fores and Winograd’s Understanding Computers and Cognition from about a generation ago also works from that perspective. Their take-down of AI, which does not depend on processing capacity, should be required reading, and holds now.
That new book will have to wait. I just started Hans Rosling (et al), from 2018. He was interesting — RIP from cancer, as book was finished — in placing Big Fallacies About World Things in terms of built-in human cognitive biases toward error, using Big Visualization, from Data Sets The OMG people like.
I can only really work one BigThink book at a time, so the new Winograd will have to wait. Sad.
I just chatted with that bot, about half hour ago. Again, I asked it if Elon Musk could become our President, and again it said yes. I asked it about natural born citizens, and asked if Elon was a natural born citizen, and it said that he was not. Then it corrected itself, that he was not eligible to become President. This is the same thing that happened last week, during our first conversation. bottom line, garbage in garbage out.
I signed up on Saturday and have been playing around with it. It is very impressive in it correctly processing the questions I have posed to it, and then providing darn good answers. You can even give it a series of questions that it then will give answers that hit all of the questions asked.
I can see where it can be useful for many Google type query’s where I’m looking for info and then need to go thru the links returned by Google. This program instead appears to do the search, find the information, and gives it back in a conversational form that a subject matter expert would reply.
As I said, both impressive and perhaps a bit spooky.
I am reminded that Isaac Asimov, whose works I enjoyed greatly as a child, until I came to recognize him as an unalloyed statist, not only invented the three laws of robotics, but ultimately a robot who saw nothing wrong with lying undetectably to alter the course of human history if he believed it to be in humanity’s best interests. (And it was not the first, as one of his earliest robots coerced itself into telling every individual it encountered whatever falsehood that person wanted to hear, in order not to “hurt their feelings.”)
Liberal infested propaganda machine.