The benefits of generative artificial intelligence tools like ChatGPT have been chronicled at length by the media, technology vendors and conference presenters. But how do these tools really stack up when HR professionals put them to the test in their daily tasks and projects?
Two recent workplace experiments pitted humans versus machines, and the results were telling about where these tools can have real benefit for human resource practitioners—and where their use can actually decrease performance and the service quality provided to HR’s clients.
Experiment 1: Testing ChatGPT’s Responses to Compliance and HR Questions
Mineral is an organization that handles a wide range of compliance and HR issues for more than 1 million clients. When ChatGPT became widely available early in 2023, leaders at the Portland, Ore.-based company began studying whether the technology could help Mineral’s HR experts address compliance questions more efficiently or effectively.
Those discussions soon included Susan Anderson, chief services officer for Mineral, who had mixed reactions when she first witnessed ChatGPT in action. “I saw it was capable of having conversational interactions and providing what seemed like credible answers to many questions,” Anderson said. “I thought it either was going to put us out of business or it would be a boon to help my service team leverage technology in new ways.”
To test that hypothesis, Mineral conducted an experiment: Could ChatGPT outperform Mineral’s human experts and improve the results it delivered to clients?
The company created a six-week test to see how three different versions of ChatGPT—3.0, 3.5 and 4.0—would respond to questions and tasks in four content areas: the Fair Labor Standards Act (FLSA), the Family and Medical Leave Act (FMLA), the American with Disabilities Act (ADA), and immigration. Questions and tasks ranged in level of complexity and covered topics like termination issues, employee leave and salary transparency.
Mineral’s seasoned HR experts scored ChatGPT’s answers in six categories: accuracy, context relevancy, consistency, brevity, bias level and practical applicability.
“We also wanted to pose some questions that weren’t straightforward and would really test whether ChatGPT was meeting our standards for quality and accuracy,” Anderson said.
The Results Are In: ChatGPT’s Report Card
How did ChatGPT perform? Not surprisingly, later versions of the technology, GPT-3.5 and especially GPT-4.0, performed much better than the original version.
“The vast majority of questions posed to GPT-3 failed by our scoring standards,” Anderson said. Why? GPT-3’s responses often were missing necessary, nuanced details.
“For example, you might ask, ‘What are the steps necessary to terminate an employee?,’ which feels like a straightforward question,” Anderson said. “But we often find when you have a conversation with an employer about this there are extenuating circumstances and more considerations and history attached to that question than originally thought. Such complexity requires a human in the loop, or for the technology to be used collaboratively with our HR experts, not simply relied upon on its own.”
Even though GPT-3.5 and GPT-4 performed better, humans are still needed to manage the process.
Another scenario requiring more nuanced responses than ChatGPT provided involved employee leave, Anderson said. “Many businesses have to take into account not just federal and state laws for leave management but local regulations as well,” she said. “We found GPT didn’t always factor in local laws or other extenuating issues around employee leave.”
GPT-4 outperformed earlier versions of the technology on the 10 questions it was given in all six categories measured. Anderson said her staff also experimented with using its own vast data to enhance the results of out-of-the-box GPT, efforts that experienced success as well as produced lessons learned.
“Our goal was to ‘fail fast’ and quickly apply any lessons learned to improve how we might use ChatGPT to both drive new efficiencies on our team and improve service to our clients,” Anderson said. It’s also important for HR teams to identify best practices for creating prompts for generative AI, she explained, since the quality of responses is dependent on using well-constructed, detailed prompts.
Anderson believes that while generative AI’s capabilities will continue to improve as updates are released—for example, GPT-4 Turbo will now have access to data and events up to April 2023, whereas previous iterations were limited to 2021 and earlier—trained HR pros will still be needed in the loop, both for their content expertise and the emotional intelligence they bring to client interactions.
“There are straightforward, binary questions we receive from customers that GPT can often handle on its own, such as what the minimum wage might be in a given state,” Anderson said. “But when it comes to compliance or HR questions with more situational complexities, that’s where we found it’s far better to have a collaboration between the technology and human expertise.”
Experiment 2: Testing Generative AI’s Value for Knowledge-Intensive Tasks
Another recent study set out to understand whether GPT-4 could have value for highly educated knowledge workers. The Harvard Business School collaborated with the Boston Consulting Group (BCG), a global management consulting firm, to study the performance implications of generative AI on real-world, knowledge-intensive tasks.
“Technologies such as generative AI often have been designed to perform repetitive, lower-level tasks, but we wanted to study its value more specifically for creative and knowledge work,” said Fabrizio Dell’Acqua, a postdoctoral research fellow at the Harvard Business School.
In the study, 758 consultants at BCG were randomly assigned to three groups: those who would work on assigned tasks with no access to GPT-4; those with access to GPT-4; and those with access to GPT-4 that also included a prompt engineering overview.
Some tasks the consultants were asked to complete were believed to be more easily handled with the aid of generative AI, Dell’Acqua said. An example of one of these tasks was to develop a new footwear product. Consultants were asked to propose a product design, write memos to company executives and develop marketing plans, among other duties. “For these type of tasks GPT-4 proved to be very helpful,” Dell’Acqua said.
But a major finding of the study was that GPT-4 can decrease performance on certain knowledge-intensive tasks. “It turns out that for some of these tasks, giving the consultants GPT-4 to use actually impaired their performance,” Dell’Acqua said. “They performed almost 90 percent worse than consultants who were operating without GPT-4.”
For example, one such task where GPT-4 proved to be a detriment was when BCG consultants were asked to analyze business cases and provide recommendations to top executives in a company. The consultants were given access to relevant data as well as interviews with employees in the company to form their recommendations.
“In our control group 85 percent of the consultants completed the business case successfully on their own, but if the consultants were just to copy and paste the answers they received from GPT-4 without additional validation and review, the answers were incorrect,” Dell’Acqua said.
Analyzing the business cases required complex thinking beyond the capabilities of GPT-4, he said. “It was about combining different pieces of information relevant for the complex process in this exercise, much like putting together a puzzle,” he said. “It’s something that GPT-4 by itself wasn’t enough for and highly skilled humans proved much better at.”
The study’s findings highlighted the importance of organizations being selective in how they employ GPT-4—as well as future iterations of the technology—to improve business operations.
“The study identified a contrast where there are big benefits generative AI can provide but at the same time areas where knowledge workers want to be careful in using it,” Dell’Acqua said. “The technology frontier is constantly shifting, and there’ll be new versions of generative AI to address it. But there will likely remain some capabilities outside of the frontier where AI may not be as effective, and it’s important for organizations to understand that.”
Dave Zielinski is principal of Skiwood Communications, a business writing and editing company in Minneapolis.
An organization run by AI is not a futuristic concept. Such technology is already a part of many workplaces and will continue to shape the labor market and HR. Here's how employers and employees can successfully manage generative AI and other AI-powered systems.