Over the past six months, I conducted a within-subjects study involving 24 software developers to observe their interactions with various LLM-based programming assistants. The developers tackled a range of programming tasks, while I collected data for 16 distinct productivity metrics using an activity tracker, self-assessment surveys, and screen recordings. The results of my study, which are summarized in this series of articles, offer a nuanced perspective on the current state of AI programming assistants and their influence on software developers.
Articles: Part 1 | Part 2 | Part 3
Recap
In this article, I present the results of my study, which act as the basis for a nuanced discussion on how LLM-based programming assistants affect developer productivity.
To recap my previous article, I invited 24 developers to work on various Python programming tasks using three different development environments. One environment served as the control scenario and had no programming assistant installed. Another environment had GitHub Copilot as a representative for autocompletion AI assistants installed. The last environment had an integrated version of ChatGPT installed and represented conversational AI assistants.
The order of the environments was randomized using a balanced Latin square design with the size of three. The task order was randomly shuffled. While the participants worked on the tasks, I gathered 16 productivity metrics in the dimensions: Satisfaction, Performance, Communication, Activity, and Efficiency (Forsgren et al. 2021).
TLDR Infographic
The most important findings discussed in this article are summarizes in this infographic:
Impact of LLM-based programming assistants on developers
First, I will outline the general impact that LLM-based assistants had on the developers, irrespective of the type of assistant utilized. Despite the core differences between the two AI assistants used in this study, their influence on developers was surprisingly similar.
đź’ˇ Increase in development speed: Having access to an AI assistant improved the speed of the developers. Working with any of the two AI-assisted environments increased the median requirements that a developer was able to implement in a coding session significantly by roughly 67.5% in the case of the autocomplete assistant and 66.7% in the case of the conversational assistants. While participants completed a median of six requirements per programming session in the control scenario, the introduction of an AI assistant increased this figure to ten requirements in both AI-enabled environments. These results comply with the findings of Peng et al. (2023), who discovered a similar 55.8 % increase in task completion time and likewise concluded that programming assistants have a positive impact on the speed of developers.
💡 No effects on code quality: The increased speed of developers when using an AI assistant raises the question of whether this improvement compromised the quality of the code. The results show that this is not the case, as the correctness of the implemented requirements did not significantly differ across the three environments. The median correctness of the implemented requirements lies between 95% and 100% for all three environments. When taking other studies into account, it is reasonable to assume that AI programming assistants do not compromise the quality of the code. Sandoval et al. (2022) observed that AI assistants did not influence the occurrence rates of severe security bugs. Moreover, Asare et al. (2022) and Pearce et al. (2021) discovered that AI assistants introduce vulnerabilities at a rate similar to human developers. Therefore, it appears that current AI assistants merely aid developers in crafting the code they would have composed anyway. Consequently, the quality of the AI recommendations is probably not the primary determinant of code quality. Rather, it is the developer’s proficiency in structuring, selecting, and reviewing suggestions that play a vital role.
💡 Developers assume a guiding role: In the AI-assisted environments, the percentage of characters that were suggested by the tools and subsequently integrated into the code increased. In the control environment, merely 22.8% of the produced code originated from browser recommendations. However, in the AI-enabled environments, this figure increased significantly to 62.7% of the total characters coming from the AI assistant in the autocomplete scenario, and 90.1% of the total characters stemming from the chatbot assistant in the conversational scenario. These results show that developers with access to an AI assistant came up with less code themselves and instead relied more on the suggestions of the assistants. Their role, therefore, changed from actively conceptualizing, remembering, and typing code themselves to guiding and correcting the AI assistant that generated the majority of the characters. Instead of only solving requirements on a line-by-line basis, developers assumed a higher-level role and guided the AI assistants on a function or even script level. Mozannar et al. (2022) also observed that introducing GitHub Copilot caused the developers’ behavior to shift drastically. Developers with access to Copilot spent more than half of their time on Copilot-related activities, and double-checking the assistant’s recommendations became the most prevalent activity.
đź’ˇ AI assistants replace traditional browser tools: Having access to an AI assistant decreased browser usage to nearly zero, even though the browser remained visibly open on the left side of the screen throughout the study. Thus, both AI assistants used in this study substituted traditional tools accessed over the browser such as online documentation and Q&A sites. Developers favored the autocomplete assistant over browser tools due to its proactiveness and seamless integration while the conversational assistant was preferred for being flexible and allowing the participants to engage in higher-level conversations. Despite the clear preference for AI-enabled environments, the participants negatively remarked that the assistants occasionally produced incorrect code or deviated from the goals stated in the prompts.
đź’ˇ Developers enjoy working with AI assistants: The developers ranked both the satisfaction and the overall support they perceived higher for the AI-enabled environments. Additionally, they stated that the pragmatic quality, hedonic quality, and attractiveness of the development environments are higher when an AI assistant is available. This positive perception of the AI assistants sparked feelings of satisfaction and joy for the developers.
The presented results highlight that LLM-based programming assistants like GitHub Copilot and ChatGPT are useful tools that companies can leverage to improve the productivity of their developers. Practitioners should also be aware that the LLM-based assistants substituted traditional browser-based tools like Q&A forums and online documentation. Therefore, it could be necessary to adapt how programming knowledge is stored and accessed in companies with the goal of offering developers a convenient way to access knowledge that is on par with LLM-based assistants. Some practitioners, such as StackOverflow, are already exploring this approach. The Q&A-based developer forum recently introduced OverflowAI, which allows users to extract knowledge from the forum via a conversational interface.
Differences between autocomplete and conversational assistants
While the overall impact of the two AI assistants is similar, the way in which this impact is achieved varies. A closer look at the measurements of the communication dimension provides a more nuanced understanding of how the interactions with the browser, autocomplete assistant and conversational assistant differed.
đź’ˇ In the control environment (IntelliSense) finding code snippets is slow and tedious: When using the control environment, the participants only had access to the standard IntelliSense suggestions and the browser. As a consequence, they sourced four code snippets with an average size of 31.2 characters per session from the browser. In comparison to the AI-based tools, the browser is used the least frequently, and the snippets taken from it had the smallest average size.
💡 The autocomplete environment (GitHub Copilot) improves productivity by frequently providing simple suggestions: The median number of accepted code snippets per session increased to 14.9 snippets with an average snippet size of 41.8 characters. The most noticeable difference between the browser and the autocomplete assistant is that the participants utilized the autocomplete assistant much more frequently. The participants’ feedback suggests that this stems from the proactive nature and seamless integration of the autocomplete assistant. The developers characterized the assistant as providing an “integrated experience”, which makes it “easy-to-use”, and “super fast”. In comparison to the browser, the average size of the snippets also increased, although the general scope of the recommendations remained comparable to the browser suggestions.
💡 The conversational environment (Integrated ChatGPT) improves productivity by providing higher-level suggestions: The median number of accepted code snippets per session was 5.5 snippets with an average snippet size of 110.9 characters. With a median of 5.5 snippets per session, the usage frequency of the conversational assistant aligns more closely with the browser than with the autocomplete assistant. Participants noted that the less integrated chatbot slows down communication as it occurs in an “additional prompt window instead of […] directly in the code”. Surprisingly, the average snippet size increased to approximately 3.6 times the size of a browser recommendation and 2.7 times the size of an autocomplete suggestion. Rather than working on single lines of code, most participants exchanged whole functions with the assistants. Three participants even asked the assistant to modify the entire Python script. As a result, the chatbot rarely recommended single lines of code, but more frequently created whole code blocks, functions, or scripts for the participants. The participants also explicitly stated that “more complex tasks could be solved with the chatbot AI” since it “knew how to adapt a whole function”. Additionally, participants agreed that the conversational environment aided them in understanding the programming tasks better. This assessment complies with the previous findings as it shows that the participants experience a high level of understanding when working with the conversational assistant, which likely encouraged them to engage in more advanced conversations and give more complex instructions.
The presented results highlight the unique characteristics of both types of AI assistants, which can be used to improve these assistants further. When implementing an autocomplete assistant, future researchers should focus on its proactiveness, seamlessness, and generation speed, since those are the qualities that led to a positive impact on the developers in this study. When implementing conversational assistants, the characteristics of flexibility, reiteration through conversation, and handling abstract communication with larger contexts should be prioritized. Practitioners should also pay attention to these characteristics when creating AI assistants. GitHub for example, is already steering the future of its AI programming assistant in a direction that complies with the findings of this study by introducing GitHub Copilot X. GitHub Copilot X is a new version of GitHub Copilot, which combines an autocomplete and a conversational interface into a hybrid assistant. As participants of this study pointed out, having access to a hybrid assistant will lead to more holistic support since the developers can choose which type of interface will support them better based on their current context.
How should AI programming assistants be used?
The findings in this study provide a detailed understanding of current LLM-based programming assistants and how they are used by developers.
On the positive side, the LLM-based programming assistants had a positive impact on the productivity of the developers overall by speeding up the programming process while simultaneously increasing the understanding, satisfaction, and perceived support of the developers. Thus, AI-based assistants are powerful tools that developers, teams, and companies should leverage to increase the efficiency of their software development processes.
Nevertheless, practitioners need to have realistic expectations when employing these assistants. The developers in this study found the assistants to be small-scoped, as the context size that they are able to handle is small compared to the size of code bases in real-world projects. Furthermore, the assistants are characterized as heavily dependent on human developers. They require exact guidance from a developer who is literate in the coding language. The prompts need to be crafted precisely, the recommendations require thorough reviews, and the suggested code snippets need to be leveraged in the right way to create a sophisticated application that satisfies the task description. The software is still conceptualized by developers, as they recognize that AI assistants are merely tools and treat them as such by reviewing their suggestions critically.
Thus, current AI programming assistants are efficient, but small-scoped and human-dependent tools. They support the developers in the software development process but require an experienced developer to be useful in real-world projects.