Improving Semantic Consistency of Variable Names with Use-Flow Graph Analysis
URL: https://euske.github.io/
Yusuke Shinyama
Yoshitaka Arahori
Katsuhiko Gondow
(APSEC 2021 Paper #83)
Background: Consistency is Crucial for Maintaining Large Software Projects
Goal: Detecting Inconsistent Names in Source Code
How to Catch Inconsistency?
"Use flow" of variables
Training and Prediction
Experiment and Evaluation
Discussion
Conclusion
1. Background: Consistency is Crucial for Maintaining Large Software Projects
Large projects are developed by a team with multiple people.
Even if there's only a single developer, they turn into a different person over time!
A style guide is set up to facilitate collaboration of team members.
Source code style guides:
There's little guideline of how to give a name to variables and functions...
Identifiers are crucial for program understanding:
Programmers rely on the meaning of names.
[Lawrie, 06]
Bad names obstructs program understanding, resulting in bad code.
[Avidan, 17]
The importance of names are also emphasized in "Code Complete"
and "The Practice of Programming".
2. Goal: Detecting Inconsistent Names in Source Code
void printResult(Stream out , String result) {
out .writeLine("result:"+result);
}
void printStat(Stream out , int stat) {
out .writeLine("stat:"+stat);
}
void printInfo(Stream strm , String info) {
strm .writeLine("info:"+info);
}
Note: "out
" might not necessarily be the best name
for an output stream, but it is consistent throughout the program.
We also tried to:
Make an adjustable system:
Each project uses the same name for different things:
"view
" = window (GUI application)
"view
" = table (SQL engine)
Or different abbreviations:
Not use a dictionary or heuristics.
Make the system transparent:
Present how they are inconsistent to the developer.
A lot of work has been done for the problem of identifiers:
Variable names are just as important as method names:
Variables are named after "things" in an application domain.
They are typically nouns, which are much more varied than verbs.
3. How to Catch Inconsistency?
Construct a mapping between the usage and the name for each variable:
Find all the variables that have the same usage :
Compare the variable names and single out the outliers.
4. "Use Flow" of Variables
Now, how do we express "the usage of a variable"?
Usage of variable = Sequence of operations
applied to the value assigned into that variable.
Let us call this dataflow of the variable "use flow ".
4.1. Example of use flows
Take a look at the variable line
:
private BufferedReader fp;
public String get() {
String line = fp.readLine();
int i = line .indexOf(' ');
return line .substring(0, i);
}
Here is its data flow visualization:
The red lines above = a use flow of line
:
fp.readLine()
→ line
→ #this:indexOf()
→ #arg1:substring()
This path shows what is assigned to the variable line
and
how it is used.
4.2. Make It Interprocedural
private BufferedReader fp;
public String get() {
String line = fp.readLine();
int i = line .indexOf(' ');
return line .substring(0, i);
}
public void show() {
String name = get();
System.out.println(name +"!!");
}
Note that line
in function get()
is now name
in function show()
.
Here's the data flow graph:
The final use flow of line
:
fp.readLine()
→ line
→ #this:indexOf()
→ #arg1:substring()
→ assign:name
→ L:+
→ #arg0:println()
This path represents the usage of the variable line
in this program.
5. Training and Prediction
Trained a Bayesian probabilistic model that predicts a variable name from a
given use flow:
The following use flow predicts ???
= line
:
fp.readLine()
→ ???
→ #this:indexOf()
→ #arg1:substring()
→ assign:name
→ L:+
→ #arg0:println()
If the variable name is other than line
,
it is inconsistent with this usage.
Algorithm:
Let V be all the variables.
For each v1 ∈ V:
For each v ∈ V except v1
(v ≠ v1 ),
compute P(v .name | v .useflow).
Find the variable name n
such that argmax P(name | v1 .useflow).
The variable name is consistent if n = v1 .name.
Otherwise, suggest n as a better name for the variable v1 .
6. Experiment and Evaluation
Evaluated our method with the following projects:
(#edges < #useflow)
Project kLoC #vars
#nodes #edges
ant (build tool) 112k 23,971
350k 5,211k
antlr4 (parser generator) 31k 7,131
74k 1,103k
bcel (byte code analyzer) 31k 6,583
80k 1,190k
compress (data compression) 24k 5,896
69k 929k
jedit (text editor) 115k 21,977
294k 6,106k
jhotdraw (diagram renderer) 80k 17,367
235k 2,351k
junit (unit testing) 9k 2,384
21k 280k
lucene (document indexing) 109k 30,341
414k 7,146k
tomcat (web server) 238k 49,275
649k 11,799k
weka (machine learning) 324k 59,274
943k 13,224k
xerces (XML parser) 114k 21,852
314k 7,017k
xz (data compression) 7k 1,825
23k 299k
Test subjects: authors (3) + grad students (6) = 9 people.
RQ1. Is Use Flow a Good Representation for Variable Usage?
Experiment 1. Variable Equivalence Test
Present a pair of variables (whose names hidden) to human subjects:
Pair R003
Choice: x. ??? a. MUST BE the same name b. CAN BE the same name c. MUST NOT BE the same name
DefaultJspCompilerAdapter.java
...
100: */
101: protected void addArg(CommandlineJava aa , String argument, String value) {
102: if (value != null) {
103: aa .createArgument().setValue(argument);
104: aa .createArgument().setValue(value);
105: }
...
DefaultJspCompilerAdapter.java
...
87: */
88: protected void addArg(CommandlineJava bb , String argument) {
89: if (argument != null && !argument.isEmpty()) {
90: bb .createArgument().setValue(argument);
91: }
...
Tested for 12 projects × 5 variable pairs × 9 subjects = 540 questionnaires.
Ratio of #MustBeSame + #CanBeSame = 68% (369/540)
High similarity in use flows = the same variable name.
ant antlr4 bcel compress jedit jhotdraw junit lucene tomcat weka xerces xz Avg. MustBeSame CanBeSame Different Unknown
Defects:
Only tested with the high similarity pairs.
Therefore the test was not completely blind. :(
RQ2. Can the System Predict a Good (Consistent) Variable Name?
Experiment 2-1. Name Suggestion Test
Present a snippet to human subjects:
Rewrite R000 (11.118)
Choice: xxx
→ x. ??? a. msg b. message c. name
BuildException.java
...
82: */
83: public BuildException(String xxx , Throwable cause, Location location) {
84: this(xxx , cause);
85: this.location = location;
...
Evidence
BuildException.java
...
67: */
68: public BuildException(String message , Throwable cause) {
69: super(message , cause);
70: }
...
Their choices are (in random order):
Original name (Orig)
Our system suggestion (Ours)
Baseline suggestion (Baseline)
Tested for 12 projects × 10 variables × 9 subjects = 1,080 questionnaires.
For the 39% (416/1080) variables, our suggestion were chosen.
The degree of agreement (Fleiss' Kappa ) = 0.45 (Moderate).
ant antlr4 bcel compress jedit jhotdraw junit lucene tomcat weka xerces xz Avg. Ours Orig+Baseline
Experiment 2-2. Sending Patches to Developers
Based on our results, we manually submitted 12 patches to the open source projects.
3 projects incorporated it.
2 projects are still in discussion.
1 projects rejected it.
RQ3. Is the System Output Explainable?
Experiment 3. Evidence Persuasiveness Test
Choose 5 variable name suggestions which was highly ranked.
Present the evidences (snippets) used for producing each suggestion.
Ask the subjects to choose one of the following:
Presented evidence is convincing. (#Good)
Presented evidence is relevant. (#Soso)
Presented evidence is irrelevant. (#Bad)
Undecidable. (#Unknown)
Tested for 12 projects × 5 questions × 9 subjects = 540 questionnaires.
ant antlr4 bcel compress jedit jhotdraw junit lucene tomcat weka xerces xz Avg. Good Soso Bad Unknown
The results did not know that our system produced good explanation
for its suggestions. :(
Anecdotal Examples
Some of the system suggestions were good:
Make the name more task oriented:
org/apache/bcel/Const.java:
- public static short getNoOfOperands(final int index) {
- return NO_OF_OPERANDS[index];
+ public static short getNoOfOperands(final int opcode ) {
+ return NO_OF_OPERANDS[opcode ];
Use the conventional abbreviation for the project.
gjt/sp/jedit/bsh/classpath/BshClassPath.java:
- void errorWhileMapping( String s ) {
+ void errorWhileMapping( String msg ) {
...
org/apache/jasper/compiler/Generator.java
- String pkgName = className.substring(0, lastIndex);
- genPreamblePackage(pkgName);
+ String packageName = className.substring(0, lastIndex);
+ genPreamblePackage(packageName );
Use a synonym which aligns better with the other parts of the project.
org/apache/xerces/impl/xpath/regex/RegexParser.java
- ReferencePosition(int n, int pos) {
+ ReferencePosition(int n, int offset ) {
Correct typos:
src/org/tukaani/xz/lz/Hash234.java
- void normalize(int normalizeOffset) {
- LZEncoder.normalize(hash2Table, HASH_2_SIZE, normalizeOffset);
+ void normalize(int normalizationOffset ) {
+ LZEncoder.normalize(hash2Table, HASH_2_SIZE, normalizationOffset );
...
7. Discussion
Use flows are somewhat good at representing the variable usage.
In real projects, our system suggested a better (more consistent) name
than the original with 39% probability.
We are not sure that our system produced a good explaination for its output.
7.1. Threats to Validity
Internal Validity (Did we answer RQs?)
The subjects might have a prior knowledge about the projects used.
The results depend on each subject's proficiency of the language.
It is not clear how many variables the proposed method can apply to.
Not every use flow is correctly obtained.
External Validity (Is our result generalizable?)
The programming language is limited to Java.
Not enough projects are tested with enough subjects.
Dynamic dispatching or variable aliasing is not considered.
Naive Bayes classifier could be improved.
8. Conclusion
We presented a framework to test the consistency of variable names.
We proposed use flow as the representation of variable usage.
We demonstrated that the proposed method can detect and correct
inconsistent variable names.
Yusuke Shinyama