Beyond Generated Code: Measuring the Real Impact of AI on Software Engineering with Data

The rise of AI tools has generated sensational headlines, with publications questioning if coders' jobs are at risk or suggesting AI is replacing junior developers. However, this overhyped narrative fails to capture the nuances of AI application, often oversimplifying to the point of being incorrect. The burden falls on engineers and engineering leaders to translate the true limitations and possibilities of these tools for their business counterparts.

One of the greatest challenges is measuring impact. Shallow metrics often cited in headlines—such as lines of code (LOC) generated or acceptance rates of suggestions—are widely considered poor measures of productivity and business impact. Looking only at acceptance rate (the percentage of suggestions accepted) doesn't inform whether the generated code increased velocity, saved time, or helped innovation. Acceptance only signals if the tool is minimally "fit for purpose". Since source code production has become trivially easy with AI, it is important to note that excess source code can be viewed as a liability, not a success metric.

Field research reveals that the greatest time saving provided by AI is not in mid-loop code generation (code generated while the developer is typing), which is only the third most effective use case. Instead, the most significant time savings occur in analyzing tricky stack traces and refactoring existing code. The technical reason for this is that AI, in these cases, eliminates the effort and toil of parsing large error outputs and spelunking (searching) in the code, providing a true net positive time saving. Code generation, conversely, only reallocates time, shifting the workload from rapid typing to reviewing and iterating on the generated code.

Paradoxically, even though AI saves time, some studies, such as those by DORA, show that many developers felt less satisfied. This happens because AI is accelerating the very parts of the job they enjoy the most (code authoring), leaving them with more time for less enjoyable tasks, such as meetings and administrative toil. Considering that AWS engineers spend only about 20% of their time coding on average, a 10% saving on that coding time does not translate into a massive increase in new product line output, but rather into time reallocation.

To measure impact effectively, organizations must use a structured approach. The DX AI Measurement Framework recommends assessing three main areas: Utilization (number of daily/weekly active users), Impact (tangible results like revenue or developer experience), and Cost. It is crucial for leaders to establish a baseline before AI implementation, using existing productivity frameworks. Furthermore, combining workflow metrics (systems) with self-reported metrics (developer experience) is essential, as not all AI impacts (like reduced cognitive load or improved interruption management) are observable solely through system data.

Companies like Workhuman demonstrated the value of this approach, observing an 11% boost in overall Developer Experience (DX) after AI implementation. Specifically, developers who used AI tools daily or weekly showed a 15% higher velocity. This pattern suggests that AI is not a "magic bullet" to solve everything, but rather a tool to improve Developer Experience (DX), which in turn leads to better organizational outcomes.

AI adoption is also driving architectural improvements. Leaders are recommitting to clean interfaces between services, which benefits both AI models (which ingest the codebase more easily) and human developers. Additionally, there is a growing emphasis on "AI-first" documentation, which must focus on coding examples rather than visual dependencies, allowing AI tools to provide accurate suggestions directly within the IDE.

Finally, how companies roll out AI is critical. Highly regulated industries (financial, pharmaceutical) are seeing the best results due to the necessity for highly deliberate and structured rollouts. Companies like Indeed adopted an experimental mindset, conducting comparative trials of different tools and validating hypotheses, such as using AI for code reviews to reduce latency and close the feedback loop. In the future, engineering leaders will face the challenge of budgeting for consumption-based and agentic models, with predictions suggesting per-developer costs might return to historical high levels of $1,200 to $2,000 USD per year or more, as the industry stabilizes and consolidates.

🎵 Spotify Podcast