Sunday, October 26, 2014

Levenshtein Distance detects differences in lines of code

After chatting with a colleague Michael Mah about the finer points of code counters a couple weeks ago, I've been studying ways to detect differences between 2 lines of code for my EZ-Metrix product.
The problem is not simply to detect differences of any kind, as this could be accomplished by a simple string compare (e.g., does string A = string B?). The challenge is in deciding if the extent of the differences between the lines constitutes, in effect, a completely new line of code. In other words, at what point does the accumulation of differences between two lines of code exceed what one might characterize as 'changes' to the point where the line is considered brand new?
This research has revealed a solution called the Levenshtein Distance (LD) algorithm (en.wikipedia.org/wiki/Edit_distance). In simple terms, the LD is essentially the minimum number of character edits (adds, modifies or deletes) that it takes to change string A into string B. If applied to a code counter, this LD value could be used along with a threshold value to decide if a line of code has been modified (LD value below a threshold), or is new (LD value above a threshold).
In EZ-Metrix, I plan to implement a % threshold, which gives the user a (configurable) way to set their own difference threshold, which is relative to each line's length. This way, short and long lines get treated the same, with respect to characterizing the line as either modified or new. An EZ-Metrix user may want to choose a threshold value of, for example 75%, which would count a line as new, if the number of changed characters exceeds 75% of the number of characters in the original line. Otherwise, the line would be counted as modified.
In the near future, a new version of EZ-Metrix (v5.0) will be released, which implements LD, among other improvements.
I'm hopefully this improvement will be embraced by the industry.