Improving Semantic Consistency of Variable Names with Use-Flow Graph Analysis

URL: https://euske.github.io/

Yusuke Shinyama
Yoshitaka Arahori
Katsuhiko Gondow
(APSEC 2021 Paper #83)

Background: Consistency is Crucial for Maintaining Large Software Projects
Goal: Detecting Inconsistent Names in Source Code
How to Catch Inconsistency?
"Use flow" of variables
Training and Prediction
Experiment and Evaluation
Discussion
Conclusion

1. Background: Consistency is Crucial for Maintaining Large Software Projects

Large projects are developed by a team with multiple people.
- Even if there's only a single developer, they turn into a different person over time!
A style guide is set up to facilitate collaboration of team members.
- Source code style guides:
  - Google JavaScript Style Guide
  - NYTimes Objective-C Style Guide
- There's little guideline of how to give a name to variables and functions...
  - Microsoft General Naming Conventions
Identifiers are crucial for program understanding:
- Programmers rely on the meaning of names. [Lawrie, 06]
- Bad names obstructs program understanding, resulting in bad code. [Avidan, 17]
- The importance of names are also emphasized in "Code Complete" and "The Practice of Programming".

2. Goal: Detecting Inconsistent Names in Source Code

void printResult(Stream out, String result) {
  out.writeLine("result:"+result);
}

void printStat(Stream out, int stat) {
  out.writeLine("stat:"+stat);
}

void printInfo(Stream strm, String info) {
  strm.writeLine("info:"+info);  // XXX "strm" should be "out".
}

Note: "out" might not necessarily be the best name for an output stream, but it is consistent throughout the program.

We also tried to:

Make an adjustable system:
Each project uses the same name for different things:
- "view" = window (GUI application)
- "view" = table (SQL engine)
Or different abbreviations:
- message
- msg
Not use a dictionary or heuristics.
Make the system transparent:
- Present how they are inconsistent to the developer.

A lot of work has been done for the problem of identifiers:
- Checking the relevance of method names: [Høst, 09]
- Generating comments from source code: [Sridhara, 12]
- Propose a method name based on code: [Allamanis, 15]
- Infer the original variable names and types from obfuscated JavaScript code: [Raychev, 15]
- Extract an embedding from code (code2vec): [Alon, 18]
Variable names are just as important as method names:
- Variables are named after "things" in an application domain.
- They are typically nouns, which are much more varied than verbs.

3. How to Catch Inconsistency?

Construct a mapping between the usage and the name for each variable:
- Variable usage:
```
    = open(...);
write(   , ...);
```
- Variable name: out
Find all the variables that have the same usage:
- ```
strm = open(...);
write(strm, ...);
```
Compare the variable names and single out the outliers.
- out
- out
- strm (Hey!)
- out

4. "Use Flow" of Variables

Now, how do we express "the usage of a variable"?
Usage of variable = Sequence of operations applied to the value assigned into that variable.
Let us call this dataflow of the variable "use flow".

4.1. Example of use flows

Take a look at the variable line:

private BufferedReader fp;

public String get() {
    String line = fp.readLine();
    int i = line.indexOf(' ');
    return line.substring(0, i);
}

Here is its data flow visualization:

The red lines above = a use flow of line:

fp.readLine() → line → #this:indexOf() → #arg1:substring()

This path shows what is assigned to the variable line and how it is used.

4.2. Make It Interprocedural

private BufferedReader fp;

public String get() {
    String line = fp.readLine();
    int i = line.indexOf(' ');
    return line.substring(0, i);
}

public void show() {
    String name = get();
    System.out.println(name+"!!");
}

Note that line in function get() is now name in function show().

Here's the data flow graph:

The final use flow of line:

fp.readLine() → line → #this:indexOf() → #arg1:substring() → assign:name → L:+ → #arg0:println()

This path represents the usage of the variable line in this program.

5. Training and Prediction

Trained a Bayesian probabilistic model that predicts a variable name from a given use flow:

The following use flow predicts ??? = line:

fp.readLine() → ??? → #this:indexOf() → #arg1:substring() → assign:name → L:+ → #arg0:println()

If the variable name is other than line, it is inconsistent with this usage.

Algorithm:

Let V be all the variables.
For each v₁ ∈ V:
1. For each v ∈ V except v₁ (v ≠ v₁), compute P(v.name | v.useflow).
2. Find the variable name n such that argmax P(name | v₁.useflow).
3. The variable name is consistent if n = v₁.name. Otherwise, suggest n as a better name for the variable v₁.

6. Experiment and Evaluation

Evaluated our method with the following projects: (#edges < #useflow)

Project	kLoC	#vars	#nodes	#edges
ant (build tool)	112k	23,971	350k	5,211k
antlr4 (parser generator)	31k	7,131	74k	1,103k
bcel (byte code analyzer)	31k	6,583	80k	1,190k
compress (data compression)	24k	5,896	69k	929k
jedit (text editor)	115k	21,977	294k	6,106k
jhotdraw (diagram renderer)	80k	17,367	235k	2,351k
junit (unit testing)	9k	2,384	21k	280k
lucene (document indexing)	109k	30,341	414k	7,146k
tomcat (web server)	238k	49,275	649k	11,799k
weka (machine learning)	324k	59,274	943k	13,224k
xerces (XML parser)	114k	21,852	314k	7,017k
xz (data compression)	7k	1,825	23k	299k

Test subjects: authors (3) + grad students (6) = 9 people.

RQ1. Is Use Flow a Good Representation for Variable Usage?

Experiment 1. Variable Equivalence Test

Present a pair of variables (whose names hidden) to human subjects:

Pair R003

Choice:

DefaultJspCompilerAdapter.java

     ...

  100:     */
  101:    protected void addArg(CommandlineJava aa, String argument, String value) {
  102:        if (value != null) {
  103:            aa.createArgument().setValue(argument);
  104:            aa.createArgument().setValue(value);
  105:        }
     ...

DefaultJspCompilerAdapter.java

     ...

   87:     */
   88:    protected void addArg(CommandlineJava bb, String argument) {
   89:        if (argument != null && !argument.isEmpty()) {
   90:           bb.createArgument().setValue(argument);
   91:        }
     ...

Tested for 12 projects × 5 variable pairs × 9 subjects = 540 questionnaires.

Ratio of #MustBeSame + #CanBeSame = 68% (369/540)
High similarity in use flows = the same variable name.

Defects: Only tested with the high similarity pairs. Therefore the test was not completely blind. :(

RQ2. Can the System Predict a Good (Consistent) Variable Name?

Experiment 2-1. Name Suggestion Test

Present a snippet to human subjects:

Rewrite R000 (11.118)

Choice: xxx →

BuildException.java

     ...

   82:     */
   83:    public BuildException(String xxx, Throwable cause, Location location) {
   84:        this(xxx, cause);
   85:        this.location = location;
     ...

Evidence

BuildException.java

     ...

   67:     */
   68:    public BuildException(String message, Throwable cause) {
   69:        super(message, cause);
   70:    }
     ...

Their choices are (in random order):

Original name (Orig)
Our system suggestion (Ours)
Baseline suggestion (Baseline)

Tested for 12 projects × 10 variables × 9 subjects = 1,080 questionnaires.

For the 39% (416/1080) variables, our suggestion were chosen.
The degree of agreement (Fleiss' Kappa) = 0.45 (Moderate).

Experiment 2-2. Sending Patches to Developers

Based on our results, we manually submitted 12 patches to the open source projects.

3 projects incorporated it.
2 projects are still in discussion.
1 projects rejected it.

RQ3. Is the System Output Explainable?

Experiment 3. Evidence Persuasiveness Test

Choose 5 variable name suggestions which was highly ranked.
Present the evidences (snippets) used for producing each suggestion.
Ask the subjects to choose one of the following:
1. Presented evidence is convincing. (#Good)
2. Presented evidence is relevant. (#Soso)
3. Presented evidence is irrelevant. (#Bad)
4. Undecidable. (#Unknown)

Tested for 12 projects × 5 questions × 9 subjects = 540 questionnaires.

The results did not know that our system produced good explanation for its suggestions. :(

Anecdotal Examples

Some of the system suggestions were good:

Make the name more task oriented:

org/apache/bcel/Const.java:
-  public static short getNoOfOperands(final int index) {
-      return NO_OF_OPERANDS[index];
+  public static short getNoOfOperands(final int opcode) {
+      return NO_OF_OPERANDS[opcode];

Use the conventional abbreviation for the project.

gjt/sp/jedit/bsh/classpath/BshClassPath.java:
-	void errorWhileMapping( String s ) {
+	void errorWhileMapping( String msg ) {
...

org/apache/jasper/compiler/Generator.java
-            String pkgName = className.substring(0, lastIndex);
-            genPreamblePackage(pkgName);
+            String packageName = className.substring(0, lastIndex);
+            genPreamblePackage(packageName);

Use a synonym which aligns better with the other parts of the project.

org/apache/xerces/impl/xpath/regex/RegexParser.java
-        ReferencePosition(int n, int pos) {
+        ReferencePosition(int n, int offset) {

Correct typos:

src/org/tukaani/xz/lz/Hash234.java
-    void normalize(int normalizeOffset) {
-        LZEncoder.normalize(hash2Table, HASH_2_SIZE, normalizeOffset);
+    void normalize(int normalizationOffset) {
+        LZEncoder.normalize(hash2Table, HASH_2_SIZE, normalizationOffset);
...

7. Discussion

Use flows are somewhat good at representing the variable usage.
In real projects, our system suggested a better (more consistent) name than the original with 39% probability.
We are not sure that our system produced a good explaination for its output.

7.1. Threats to Validity

Internal Validity (Did we answer RQs?)

The subjects might have a prior knowledge about the projects used.
The results depend on each subject's proficiency of the language.
It is not clear how many variables the proposed method can apply to.
Not every use flow is correctly obtained.

External Validity (Is our result generalizable?)

The programming language is limited to Java.
Not enough projects are tested with enough subjects.
Dynamic dispatching or variable aliasing is not considered.
Naive Bayes classifier could be improved.

8. Conclusion

We presented a framework to test the consistency of variable names.
We proposed use flow as the representation of variable usage.
We demonstrated that the proposed method can detect and correct inconsistent variable names.

Yusuke Shinyama