In the rapidly evolving landscape of tax, innovation is key to staying ahead. One of the latest tools in our arsenal as tax specialists is Generative AI (GenAI), specifically models like GPT-4, which have shown promise in various professional domains. However, the effectiveness of these models often hinges on a crucial aspect: how the users engineer their prompts. But what exactly is prompt engineering, why does it matter in the world of tax?
Understanding Prompt Engineering in Tax
Prompt engineering is the process of crafting specific, nuanced instructions (or prompts) to elicit the best possible responses from GenAI models like GPT-4. This is particularly relevant in the tax profession, where accuracy, precision, and traceability are paramount. By refining how we communicate tasks to AI, we can significantly enhance the quality of the outputs, making them more useful for tax professionals across various domains.
The Experiment: Testing the Impact of Prompt Engineering
To quantify the impact of prompt engineering on tax-related tasks, we conducted an experiment involving specialists from various tax departments: direct tax, indirect tax, payroll tax, international tax, tax compliance, and transfer pricing. These specialists provided us with standard tasks typical of their practice areas. The tasks were aligned with five key GenAI use cases: generating tax outputs, summarizing tax text, classifying items for tax purposes, translating tax outputs, and conducting tax research.
We initially executed these tasks using a secure GenAI solution based on GPT-4, asking the specialists to rate the outputs according to an ISO standard for data quality. Following this, we repeated the tasks but incorporated various prompt engineering techniques, including Few Shot, Persona, Audience, Output/Instruction, Template, and Chain-of-Thought techniques.
Our Findings: The Power of Prompt Engineering
The results of our experiment were clear: prompt engineering significantly improved the quality of outputs across all tax tasks, with an overall increase in output quality score of 14% on average. According to our colleagues, the 'report grades' of the answers shot from a 7.4 to almost an 8.5.
When breaking down the results by use case, we observed that prompt engineering had the most significant impact on tax research, with a remarkable 28% improvement in output quality. This suggests that prompt engineering is particularly effective in scenarios requiring complex reasoning and the synthesis of information. On the other hand, the impact on summarization tasks was more modest, with only a 1% improvement, indicating that GenAI is already quite adept at condensing information in a useful way.