At Microsoft Research, there are computer scientists and
mathematicians who live in a world of theory and abstractions. Then
there is Nachi
Nagappan, who was on loan to the Windows development group for
a year while building a triage system for software bugs. For
Nagappan, a senior researcher at
Microsoft Research Redmond with the Empirical
Software Engineering and Measurement Research Group (ESM), the ability to observe software-development
processes firsthand is critical to his work.
The ESM group studies large-scale software development and takes
an empirical approach. When Nagappan gets involved in hands-on
projects with Microsoft development teams, its all part of ongoing
research in his quest to validate conventional software-engineering
wisdom.
Nachi Nagappan
A big part of my learning curve when I joined Microsoft in 2005,
Nagappan says, was getting familiar with how Microsoft does
development. I found then that many of the beliefs I had in
university about software engineering were actually not that true
in real life.
That discovery led Nagappan to examine more closely the
observations made by Frederick Brooks in The Mythical Man
Month, his influential book about software engineering and
project management.
To some degree, The Mythical Man Month formed the
foundation of a lot of the work we did, Nagappan says. But we also
studied other existing assumptions in software engineering. They
can be good or bad, because people make decisions based on these
assumptions. Our primary goal was to substantiate some of these
beliefs in a Microsoft context, so managers can make decisions
based on data rather than gut feel or subjective experience.
More Isnt Always Better
One assumption Nagappan examined was the relationship between
code coverage and software quality. Code coverage measures how
comprehensively a piece of code has been tested; if a program
contains 100 lines of code and the quality-assurance process tests
95 lines, the effective code coverage is 95 percent. The ESM group
measures quality in terms of post-release fixes, because those bugs
are what reach the customer and are expensive to fix. The logical
assumption would be that more code coverage results in
higher-quality code. But what Nagappan and his colleagues saw was
that, contrary to what is taught in academia, higher code coverage
was not the best measure of post-release failures in the field.
Furthermore, Nagappan says, when we shared the findings with
developers, no one seemed surprised. They said that the development
community has known this for years.
The reason is that software quality depends on so many other
factors and dynamics that no one metric can predict qualityand not
all metrics apply to all projects. Nagappan points out two of the
most obvious reasons why code coverage alone fails to predict error
rates: usage and complexity.
Code coverage is not indicative of usage. If 99 percent of the
code has been tested, but the 1 percent that did not get tested is
what customers use the most, then there is a clear mismatch between
usage and testing. There is also the issue of code complexity; the
more complex the code, the harder it is to test. So it becomes a
matter of leveraging complexity versus testing. After looking
closely at code coverage, Nagappan now can state that, given
constraints, it is more beneficial to achieve higher code coverage
of more complex code than to test less complex code at an
equivalent level. Those are the kinds of tradeoffs that development
managers need to keep in mind.
Write Test Code First
Nagappan and his colleagues then examined development factors
that impact quality, another area of software engineering discussed
in The Mythical Man Month. One of the recent trends that
caught their interest was development practices; specifically,
test-driven development (TDD) versus normal development. In TDD,
programmers first write the test code, then the actual source code,
which should pass the test. This is the opposite of conventional
development, in which writing the source code comes first, followed
by writing unit tests. Although TDD adherents claim the practice
produces better design and higher quality code, no one had carried
out an empirical substantiation of this at Microsoft.
The nice thing about working at Microsoft, Nagappan says, is
that the development organization is large enough that we could
select teams that allowed for an apples-to-apples comparison. We
picked three development projects under the same senior manager and
looked at teams that used TDD and those that didnt. We collected
data from teams working on Visual Studio, Windows, and MSN and also
got data from a team at IBM, since the project was a joint
study.
The study and its results were published in a paper entitled
Realizing quality improvement through test driven development:
results and experiences of four industrial teams, by
Nagappan and research colleagues E. Michael Maximilien of the IBM
Almaden Research Center; Thirumalesh Bhat, principal
software-development lead at Microsoft; and Laurie Williams of
North Carolina State University. What the research team found was
that the TDD teams produced code that was 60 to 90 percent better
in terms of defect density than non-TDD teams. They also discovered
that TDD teams took longer to complete their projects15 to 35
percent longer.
Over a development cycle of 12 months, 35 percent is another
four months, which is huge, Nagappan says. However, the tradeoff is
that you reduce post-release maintenance costs significantly, since
code quality is so much better. Again, these are decisions that
managers have to makewhere should they take the hit? But now, they
actually have quantified data for making those decisions.
Proving the Utility of Assertions
Another development practice that came under scrutiny was the
use of assertions. In software development, assertions are
contracts or ingredients in code, often written as annotations in
the source-code text, describing what the system should do rather
than how to do it.
One of our very senior researchers is Turing Award winner Tony
Hoare of
Microsoft Research Cambridge in the U.K., Nagappan says. Tony
has always promoted the utility of assertions in software. But
nobody had done the work to quantify just how much assertions
improved software quality.
One reason why assertions have been difficult to investigate is
a lack of access to large commercial programs and bug databases.
Also, many large commercial applications contain significant
amounts of legacy code in which there is minimal use of assertions.
All of this contributes to lack of conclusive analysis.
At Microsoft however, there is systematic use of assertions in
some Microsoft components, as well as synchronization between the
bug-tracking system and source-code versions; this made it
relatively easy to link faults against lines of code and
source-code files. The research team managed to find
assertion-dense code in which assertions had been used in a uniform
manner; they collected the assertion data and correlated assertion
density to code defects. The results are presented in the technical
paper
Assessing the Relationship between Software Assertions and Code
Quality: An Empirical Investigation, by Gunnar Kudrjavets,
a senior development lead at Microsoft, along with Nagappan and Tom
Ball.
The team observed a definite negative correlation: more
assertions and code verifications means fewer bugs. Looking behind
the straight statistical evidence, they also found a contextual
variable: experience. Software engineers who were able to make
productive use of assertions in their code base tended to be
well-trained and experienced, a factor that contributed to the end
results. These factors built an empirical body of knowledge that
proved the utility of assertions.
The work also brings up another issue: What kind of action
should development managers take based on these findings? The
research team believes that enforcing the use of assertions would
not work well; rather, there needs to be a culture of using
assertions in order to produce the desired results. Nagappan and
his colleagues feel there is an urgent need to promote the use of
assertions and plan to collaborate with academics to teach this
practice in the classroom. Having the data makes this easier.
Has there been any feedback from Hoare?
Absolutely, Nagappan says. He followed up and read our work on
assertions and was very happy that someone was proving the
relationship between assertions and software quality.
Organizational Structure Does Mattera Lot.
Nagappan recognized that although metrics such as code churn,
code complexity, code dependencies, and other code-related factors
have an impact on software quality, his team had yet to investigate
the people factor. The Mythical Man Month is most famous for
describing how communication overhead increases with the number of
programmers on a project, but it also cites Conways Law,
paraphrased as, If there are N product groups, the result will be a
software system that to a large degree contains N versions or N
components. In other words, the system will resemble the
organization building the system.
The first challenge was to somehow describe the relationships
between members of a development group. The team settled on using
organizational structure, taking the entire tree structure of the
Windows group as an example. They took into account reporting
structure but also degrees of separation between engineers working
on the same project, the level to which ownership of a code base
rolled up, the number of groups contributing to the project, and
other metrics developed for this study.
The Influence of Organizational Structure on Software Quality: An
Empirical Case Study, by Nagappan, Brendan
Murphy of Microsoft Research Cambridge, and Victor R. Basili of
the University of Maryland, presents startling results:
Organizational metrics, which are not related to the code, can
predict software failure-proneness with a precision and recall of
85 percent. This is a significantly higher precision than
traditional metrics such as churn, complexity, or coverage that
have been used until now to predict failure-proneness. This was
probably the most surprising outcome of all the studies.
That took us by surprise, Nagappan says. We didn't expect it to
be that accurate. We thought it would be comparable to other
code-based metrics, but these factors were at least 8 percent
better in terms of precision and recall than the closest factor we
got from the code measures. Eight percent, on a code base the size
of Windows, is a huge difference.
Geographical Distance Doesnt Matter Much.
One of the most cherished beliefs in software project management
is that a distributed-development model has a negative impact on
software quality because of problems with communication,
coordination, culture, and other factors. Again, this meant looking
at organizational metrics.
The fact is, Nagappan says, that no one has really studied a
large project. Most studies were either based on assumptions or on
an outsourced model where a piece of the project is handled
outside. But at Microsoft, we dont outsource product development.
Our global development teams are Microsoft employees: the same
management structure, access to the same resources.
But first of all, how do you define distributed? The research
team took the corporate address book and came up with six degrees
of distribution:
- In the same building.
- In different buildings but sharing a cafeteria.
- On the same campus, within walking distance.
- In the same region, within easy driving distance.
- In the same time zone.
- More than three time zones away.
Next, they classified all developers in the Windows team into
these buckets. Then they looked for statistical evidence that
components developed by distributed teams resulted in software with
more errors than components developed by collocated teams.
Does distributed development affect software quality? An empirical
case study of Windows Vistaby Christian Bird, University of
California, Davis; Nagappan; Premkumar Devanbu, University of
California, Davis; Harald Gall, University of Zurich, and
Murphyfound that the differences were statistically negligible. In
order to verify results, the team also conducted an anonymous
survey with researchers Sriram
Rajamani and Ganesan
Ramalingam in Microsoft
Research India, asking engineers who they would talk to if they
ran into problems. Most people preferred to talk to someone from
their own organization 4,000 miles away rather than someone only
five doors down the hall but from a different organization.
Organizational cohesiveness played a bigger role than geographical
distance.
Putting Research to Work in the Real World
Many of the findings from these papers have been put to use by
Microsoft product teams. Some of the tools that shipped with Visual
Studio 2005 and Visual Studio 2008 incorporate work from the ESM
group, and Microsofts risk-analysis and bug-triage system for
Windows Vista SP2 made use of the teams technology for risk
estimation and analysis.
Nagappan believes there is value in further exploring
organizational metrics but adds that more real-world context needs
to be applied, because his studies have been confined to Microsoft
development groups. He would like to extend his work to other
commercial environments, distributed development teams, and
open-source environments.
But there is one point that gives this software-engineering myth
buster a great deal of satisfaction.
I feel that weve closed the loop, Nagappan says. It started with
Conways Law, which Brooks cited in The Mythical Man-Month;
now, we can show that, yes, the design of the organization building
the software system is as crucial as the system itself.