Source code in software systems has been shown to have a good
degree of repetitiveness at the lexical, syntactical, and API usage
levels. This paper presents a large-scale study on the repetitiveness,
containment, and composability of source code at the semantic level.
We collected a large dataset consisting of 9,224 Java projects with
2.79M class files, 17.54M methods with 187M SLOCs. For each
method in a project, we build the program dependency graph (PDG)
to represent a routine, and compare PDGs with one another as well
as the subgraphs within them. We found that within a project, 12.1%
of the routines are repeated, and most of them repeat from 2–7 times.
As entirety, the routines are quite project-specific with only 3.3% of
them exactly repeating in 1–4 other projects with at most 8 times.
We also found that 26.1% and 7.27% of the routines are contained
in other routine(s), i.e., implemented as part of other routine(s) elsewhere
within a project and in other projects, respectively. Except for
trivial routines, their repetitiveness and containment is independent
of their complexity. Defining a subroutine via a per-variable slicing
subgraph in a PDG, we found that 14.3% of all routines have all
of their subroutines repeated. A high percentage of subroutines in
a routine can be found/reused elsewhere. We collected 8,764,971
unique subroutines (with 323,564 unique JDK subroutines) as basic
units for code searching/synthesis. We also provide practical
implications of our findings to automated tools.