The opinions expressed in this note are those of the authors and should not be attributed to the Bank of Italy. We would like to thank Luigi Bellomarini, Marco Benedetti, Livia Blasi, Andrea Gentili, Alessandro Maggi, Michele Savini Zangrandi, Giovanni Veronese, and Giuseppe Zingrillo.
Large language models (LLMs), artificial intelligence models that learn the syntax and semantics of human language, are sometimes portrayed as a groundbreaking productivity aid – including for creative work. We run an experiment to assess the potential of ChatGPT, one of the leading LLMs, for complex writing tasks. We ask the model to compose a policy brief for the Bank of Italy’s Board. We find that ChatGPT can accelerate workflows by providing well-structured content suggestions, and by producing linguistically correct text. It does, however, require a significant amount of expert supervision, which partially offsets productivity gains. If ChatGPT is used without sufficient preparation, the output may be incorrect, superficial or irrelevant.
Large language models (LLMs) are machine learning models trained to capture the syntax and semantics of language. They do so by looking at very large textual datasets and figuring out how words are combined to form sentences. Users can interact with LLMs without any coding abilities – say, they can ask a question in ordinary language. The model’s answer is also in ordinary language.
For a long time, LLMs commanded little attention outside of specialist circles. Their performance was poor, when measured in terms of how human-like they sounded. This changed in 2022, as OpenAI’s ChatGPT 3.5 convincingly simulated human conversational abilities. Other models followed, and within a few months LLMs appeared ready to “disrupt even creative [and] tacit-knowledge […] work” (Noy and Zhang, 2023).
We ran an experiment to test this claim. We asked ChatGPT1 to compose a policy brief for the Board of the Bank of Italy. We found that the model can accelerate workflows, first by providing structured content suggestions, then by producing linguistically correct text in a matter of seconds. It does, however, require a substantial amount of expert supervision, which partially offsets productivity gains.
Ours is not, by any means, the first experiment into the use of ChatGPT for non-trivial intellectual tasks. To name but some recent contributions in economics, Korinek (2023) discusses the model’s potential for research, while Cowen and Tabarrok (2023) focus on its use in teaching. Taliaferro (2023) looks at how ChatGPT performs at constructing novel datasets. Hansen and Kazinnik (2023) assess whether ChatGPT can decipher “Fedspeak”, or the language used by the Federal Reserve in communicating its policy stance. Eisfeldt, Schubert and Zhang (2023) find that the release of ChatGPT had a positive impact on equity value for firms with a high share of “tasks currently performed by labor […that can] be performed (or made more efficient) by Generative AI”, including LLMs.
The pitfalls of using ChatGPT naively and the importance of expert supervision are evident, first and foremost, from the large body of work on prompt optimization. ChatGPT generates content in response to prompts, which do not necessarily come in the form of questions. Sometimes, even small tweaks can trigger dramatic changes in the output. For example, Kojima et al. (2023) find that simply prefacing prompts with “Let’s think step by step” vastly improves ChatGPT’s performance on challenging reasoning questions.
We started our experiment by asking ChatGPT to find an appropriate communication style for our task.
This response was quite surprising to us. The cultural stereotyping does not represent facts accurately and seems misaligned with the spirit, if not the letter, of ChatGPT’s usage policies.2
Eventually, we understood that ChatGPT had copied the answer from the internet. The source listed was the website for a private “cultural awareness training consultancy”3, found through a Bing search. We do not know why this particular result was selected. Once we told the model to rely on its internal body of knowledge, as opposed to going online, we obtained an appropriate answer.4
This exemplifies a frequently reported issue with ChatGPT. It has a cut-off date for training (at the time of writing, April 2023). If asked about current events, the AI delivers outdated information that some users may not recognize as such. Internet browsing may seem like an optimal solution, yet ChatGPT may fail to “think critically” when faced with information that was not in its training set.
We then proceeded to the main task. First, we requested an outline for a note titled “Benefits and risks of using ChatGPT and similar applications in economics and finance”. With only very brief further instruction on the desired content, this is what the model provided:
The production of outlines is among the tasks for which we found ChatGPT most useful. Acceptable quality can be obtained in a few seconds and without sophisticated prompt engineering. In our case, it only took two prompts to get the specimen pasted above. At first blush, it seemed to cover most relevant topics, it offered a clean structure, it was sufficiently interdisciplinary, and it appeared appropriate for the intended audience.
We then requested a 2,500-word essay based on the outline. Space constraints prevent us from copying the essay in this note; see Biancotti and Camassa (2023) for the full text. Our key takeaways in analyzing the output were as follows.
ChatGPT can write clearly, and provide task-appropriate content. It is especially valuable for producing outlines on any topic, a very fast process that can support human exploration of ideas. The AI also works well for editing and formatting tasks.
On the other hand, it requires a substantial amount of expert supervision. The task at hand — writing a policy brief — is admittedly complex: it requires not just writing fluency, but also cross-domain knowledge and the ability to tailor the text to a very specific audience without diluting the information content.
We find that ChatGPT’s attempts at this task are not always salient, and easily drift into banality. This is a serious issue for policy advisory directed at a high-level audience. The software can generate false claims, so double-checking output for accuracy is of the essence.
The algorithm is also sensitive to how instructions, or prompts, are formulated. Where the AI cannot think like a human (yet), it is humans who have to think like an AI and express requests in the way most likely to generate acceptable results. Optimization of prompting for institutional communication is one evident area for future research. Another is fine-tuning of LLMs with the aim of generating domain-specific, possibly long-tail world knowledge in our reference context.
We conclude that ChatGPT can enhance productivity in policy-oriented writing, especially in the initial phase of outlining and structuring ideas, provided that users are knowledgeable about LLMs in general and about peculiar features of ChatGPT. Naive use leads to low-quality output and should be avoided.
The AI agrees with us. In its own words, “while ChatGPT can generate content at a high level and provide valuable information on a wide array of topics, it should be seen as a tool to aid in research and discussion, rather than a replacement for true expert analysis and insight. It’s best used to provide general information, generate ideas, or aid in decision-making processes, but should always be supplemented with rigorous research and expert opinion for high-level academic or professional work”.
Biancotti, C. & Camassa, C. (2023), Loquacity and Visible Emotion: ChatGPT as a Policy Advisor, Bank of Italy Occasional Papers 814.
Cowen, T. and Tabarrok, A. (2023), How to Learn and Teach Economics with Large Language Models, Including GPT, George Mason University Working Paper in Economics 23-18.
Eisfeldt A. L., Schubert, G. and M. B. Zhang (2023), Generative AI and Firm Values, NBER Working Paper 31222.
Hansen, A. and S. Kazinnik (2023), Can ChatGPT Decipher Fedspeak?, mimeo, Federal Reserve Bank of Richmond.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022), Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems, 35, 27730-27744.
Kandpal, N., Deng, H., Roberts, A., Wallace, E. & C. Raffel (2022), Large Language Models Struggle to Learn Long-Tail Knowledge, arXiv preprint arXiv: 2211.08411.
Korinek, A. (2023), Language models and cognitive automation for economic research, CEPR discussion paper 17923.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022), Large language models are zero-shot reasoners, arXiv preprint arXiv:2205.11916.
Noy, S. and Zhang, W. (2023), Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence.
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., … & Kaplan, J. (2022), Discovering Language Model Behaviors with Model-Written Evaluations, Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434.
Taliaferro, D. (2023), Constructing novel datasets with ChatGPT: Opportunities and limitations, Voxeu, June 15.
ChatGPT 4.0, base version of May 24, 2023 (see here). We used the TeamGPT app to collaborate.
OpenAI, Usage Policies, updated on March 23, 2023.
The source provided by ChatGPT is World Business Culture.
This response was obtained using the same prompt as above. All remaining interactions in this Section and in Section 4 were part of a single thread.