Temporal correlation in Git repositories
Join the DZone community and get the full member experience.
Join For FreeMichael Feathers presented his recurring idea of finding out which elements of design change together: his goal is to discover which classes or methods are really coupled by analyzing empirical data instead of static analysis. Since he didn't publish code, I'm trying to replicate the analysis with a language-agnostic script.
A different approach to discover dependencies
The data he mines for this information are commits in the source code repository; this approach is different from the static analysis one. In the latter scenario, the analysis consist in using reflection to annotate which class extends or implement another; two elements result coupled or not coupled just as a function of their declarations, without taking into consideration how frequently they are changed or when a change into one ripples into the other.
In my opinion, another problem with static analysis is that in very dynamic languages there are not even relationships that can be analyzed. In PHP we still have explicit interfaces, but in Ruby or JavaScript there are no such things.
How it works
The analysis of Michael Feathers was conducted at the method level. The basic idea is to look at the commits as a source of data about changes in your codebase, more than to the code itself.
The assumption, which should be correct in most workflows, is that a commit correlates in an atomic operation many classes and methods that constitute a new functionality or a refactoring. Therefore, each commit that you find where class A and B are modified together adds a score to the relationship between them.
In his example, he worked with classes and methods. I put together a quicker example based on files, since in our standard one class always corresponds to a single file.
If we are able to gather which files are always modified together, we can discover some hidden coupling on an objective basis. For example, I would be interested in code that changes together between layers, or between distant part of the codebase. Feathers himself writes about informative data on the coupling of domain objects.
My implementation counts
The basic item considered by my code is a couple of files (read classes), represted as two file names. The focus on file also makes this analysis language-agnostic.
Each commit indicates that the files A, B, C, D... were modified together. So for each commit I add a certain score to the couples A,B; A,C; A,D; B,C; B,D; C,D. I do the analysis over at least a hundred (200) commits for statistical significance. It takes long.
The score for each commit is normalized to the number of files in the commit:
- with just two files, the pair gets an addition to the score of 0.5.
- with N files, each possible pair gets an addition of 1/N. e.g. N = 100 results in each pair being added 0.01. Thus, large commits where you just modify a copyright notice or a coding standard throughout hundreds of files do not count much (should they?).
An example
I run the code over the master branch of Doctrine 2 on Github (24042863acbabdcd0fa1432135a9836467f3bce7 at the time of this writing). These were the results (limited to the first 10 pairs).
array(10) { ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=> float(31.318715901959) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=> float(29.233642885566) ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=> float(27.059301324524) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Query/Parser.php"]=> float(25.077708602949) ["lib/vendor/doctrine-dbal|lib/Doctrine/ORM/Query/SqlWalker.php"]=> float(22.883601306557) ["lib/Doctrine/ORM/UnitOfWork.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=> float(22.658756168829) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataFactory.php"]=> float(22.625098398742) ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Query/Parser.php"]=> float(21.878131735017) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Persisters/BasicEntityPersister.php"]=> float(21.422968340511) ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/UnitOfWork.php"]=> float(20.691940478915) }
The 1st, 3rd, 8th and 10th instances are example of coupling between test and production code, which is to be expected. Unless the test is really unrelated from the production code, I won't say this points an issue. SelectSqlGenerationTest is a very important functional test which verifies that operations involving SELECT queries are executed correclyt: probably new test cases lead to production code being add.
If we delete the test-related pairs, we get:
["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=> float(29.233642885566) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Query/Parser.php"]=> float(25.077708602949) ["lib/vendor/doctrine-dbal|lib/Doctrine/ORM/Query/SqlWalker.php"]=> float(22.883601306557) ["lib/Doctrine/ORM/UnitOfWork.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=> float(22.658756168829) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataFactory.php"]=> float(22.625098398742) ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Persisters/BasicEntityPersister.php"]=> float(21.422968340511)
Query/SqlWalker is a class which is always present in these relationships with other classes. Its responsibility is a big one: transforming a parsed tree of DQL (language for querying objects) into SQL queries.
In fact, a quick test on classes length returns:
[09:21:25][giorgio@Desmond:~/code/doctrine2]$ wc `find . -name '*.php'` | sort -n | tail -n 5 1942 6826 65254 ./lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php 2062 6731 80818 ./lib/Doctrine/ORM/Query/SqlWalker.php 2410 8308 95066 ./lib/Doctrine/ORM/UnitOfWork.php 3004 7535 101628 ./lib/Doctrine/ORM/Query/Parser.php 76445 223518 2587575 total
Thus SqlWalker is one of the biggest classes in the codebase, and one that changes very often with others: if the goal is to get the best bang for my buck, it would be my starting point for refactoring. But now I have to go back to my company's code to find out where the pain points are. :)
I forgot: here's my quickly hacked up code for getting these statistics.
<?php class IncidenceMatrix { private $matrix = array(); public function addHit($element, $otherElement, $score) { if ($element > $otherElement) { $first = $element; $second = $otherElement; } else { $first = $otherElement; $second = $element; } if ($first == '' or $second == '') { return; } $key = $first . '|' . $second; if (!isset($this->hash[$key])) { $this->hash[$key] = 0; } $this->hash[$key] += $score; } public function getTopHits($howMany) { $hash = $this->hash; arsort($hash); return array_slice($hash, 0, $howMany); } } $commitListCommand = 'git log --oneline | head -n 200'; exec($commitListCommand, $logOfCommits); $incidenceMatrix = new IncidenceMatrix(); foreach ($logOfCommits as $commitLog) { list ($commit, ) = explode(' ', $commitLog); $filesListCommand = "git show --pretty='format:' --name-only $commit"; exec($filesListCommand, $fileList); $commitScore = 1 / count($fileList); for ($i = 0; $i < count($fileList); $i++) { for ($j = $i + 1; $j < count($fileList); $j++) { $incidenceMatrix->addHit($fileList[$i], $fileList[$j], $commitScore); } } } $topHits = $incidenceMatrix->getTopHits(10); var_dump($topHits);
Opinions expressed by DZone contributors are their own.
Comments