Zipf's Law and Software Engineering
Sunday, January 31, 2010
All well designed software applications are alike; each badly designed software application is badly designed in its own way.For some time now I've been looking at Zipf's Law and wondering if it applies to computer programs written in any modern computer language. In other words, is
Zipf's Law relevant when analyzing computer code? And if it is relevant, what does it say about the structure and the correctness of software?
Wentian Li summarizes Zipf's Law as "the observation that frequency of occurrence of some event (
P), as a function of the rank (
i) when the rank is determined by the above frequency of occurrence, is a power-law function P
i ~ 1/i
a with the exponent
a close to unity (1)."
For the sake of argument, let
P (a random variable) represented the frequency of occurrence of a keyword in a program listing.
On the surface, Zipf's Law is meaningless when talking about any program written in any contemporary programming language because every programming language has a limited number of keywords and some keywords are used more than others. For example, the keyword
goto
exists in most modern languages, but we shy away from its use, as we've been taught that gotos are evil--and they are, though they have their place. In all likelihood, this keyword has a low frequency of occurrence in most programs.
On the other hand, the keyword
for
is common in algorithms dealing with data, and likely to have a higher frequency of occurrence in source code. Therefore, I conclude, without empirical proof because it's an obvious finding, that any computer program written in any contemporary programming language has a power law distribution, i.e., some keywords are used more than others.
However interesting the frequency distribution of computer language keywords is, it has little practical value. Of much more interest to software engineering is the context and combination of all the keywords--entire stories can be told in computer programs. For instance, we create entities that don't exist except in computer memory at run time; we create logic nodes that will never be tested because it's impossible to test every logic branch; we create information flows in quantities that are humanly impossible to analyze with a glance; in sum, we create chaos and order at the same time. The second law of thermodynamics is at play everywhere: the more combination of keywords we add, the entropy of the state machine increases.
Because I'm arguing that what matters in software applications is the combination of keywords within the context of a solution and not their quantity used in a program, I need to explain what context means. This is not a trivial task because the context of an application is attached to the problem being solved and every problem to solve is different and must have a specific program to solve it. (Don't confuse reusable code here; even if a framework is used to solve a problem, the solution will be unique in its own way.)
Although a program could be syntactically correct, it doesn't mean that the algorithms implemented solve the problem at hand. What's more, a correct program can solve the wrong problem. Let's say we have the simple requirement of printing "Hello, World!" A syntactically correct solution in Java looks as follows:
public class SayHello {
public static void main(String[] args) {
System.out.println("Jose Sandoval!");
}
}
This solution is obviously wrong because it doesn't solve the original requirement. This means that the context of the solution within the problem being solved needs to be determined to ensure its quality. In other words, we need to verify that the output matches the original requirement.
You could argue that the scenario I've presented here doesn't happen in real life; however, you would be surprised as to how often it actually does happen.
In the past, I have coded a perfect solution to a non-existent problem, and the issue wasn't the requirement gathering phase. The issue was that once the application became an executable program, the problem I was originally solving wasn't a problem anymore--the requirements had evolved. It took another iteration to get the right solution.
This is a valuable lesson to learn, however; one that I wish every developer I come in contact with has learned the hard way (code the correct solution to a non-existing problem). There's no finger pointing in the process; what's interesting to me is to know what he or she does now to prevent the same mistake.
It's becoming clear that Zip's Law, as I'm using it here to count keywords, has nothing to say for my application above. Zip's Law can't even say too much about larger systems I've written if all I'm doing is grouping program statements. Interesting patterns, however, begin to emerge when you begin to look at an aggregate body of source files. The most important are reusability, coupling, and proper encapsulation. The length of classes and functions becomes important and it's easy to argue for short versus long functions--typically, a long function is problematic to debug and maintain, and almost impossible to extend.
So modern software engineering university courses preach the main three tenets of object oriented development as being the silver bullet for woes. Zip's Law seems to support this practice as exposed in the paper
Understanding the Shape of Java Software. This research looks at large software projects to try to understand what makes a system successful, and the authors seem to have found commonalities on the things that we've come accept as
good design principles of software systems.
We seem to take the three tenets of object oriented engineering at face value; however, there are solid theoretical reasons why they hold true in large scale systems. Yes, proper OOD is the way to go.
Coming back to my paraphrasing of Anna Karenina's first paragraph: successful projects do seem to have common processes; however, unsuccessful projects will be unsuccessful on their on ways.
Comments: