How Do Programmers Express High-Level Concepts using Primitive Data Types?

URL: https://euske.github.io/
Yusuke Shinyama
Yoshitaka Arahori
Katsuhiko Gondow
(APSEC 2021 Paper #125)
  1. Background
  2. Basic Idea
  3. Experiment
  4. Inferring C-types
  5. Conclusion

1. Background

1.1. Research Questions

  1. What are common "high-level data types"?
  2. How do programmers express these high-level types in source code?
  3. Is it possible to infer them from surface clues?

2. Basic Idea

When programmers make an API call, they are aware of the role of its arguments:

String x = "foo/bar.txt";
var f = new java.io.File(x);  // x is a pathname

We studied Java Standard API and defined 12 commonly used c-types:

C-TypeLanguage TypeDescription# Methods
PATH String Path name 14
URL String URL/URI 4
SQL String SQL statement 10
HOST String Host name 17
PORT int Port number 25
XCOORD int X coordinate (GUI) 25
YCOORD int Y coordinate (GUI) 25
WIDTH int width (GUI) 24
HEIGHT int height (GUI) 24
YEAR int year 18
MONTH int month 14
DAY int day 18
Total218

Method examples:

The criterion that we used are:

  1. Clearly defined and well understood.
  2. Distinct enough to not mix up with other concepts.
  3. Widely used in many applications.

3. Experiment

We did the following experiments:

  1. Choose medium-sized 26 Java open source projects.
  2. Search all the specified API calls.
  3. Extract c-type expressions from the arguments of each API call.
  4. Examine the type, complexity and commonly used words for each c-type expression.

C-type Expression Examples:

zookeeper/.../TxnLogToolkit.java:
    ...
    File file = new File(dir.getPath() + File.separator + Util.makeLogName(zxid)[PATH]);
jitsi/.../SIPCommSplitPaneDivider.java:
    ...
    rightButton.setBounds((insets.left * 2) + leftSize.width + rightSize.width[XCOORD],
                          y[YCOORD], rightSize.width[WIDTH], rightSize.height[HEIGHT]);

3.1. Projects and Sizes

ProjectDescriptionLoC
hadoop 3.3.1 distributed computation 1,789k
ghidra 10.0 binary analyzer 1,588k
ignite 2.10.0 distributed database 1,165k
jetty 11.0.5 web container 441k
kafka 2.7.1 stream processing 384k
tomcat 8.5.68 web server 349k
jitsi 2.10 video conference 327k
binnavi 6.1.0 binary analyzer 309k
netty 4.1.65 network library 303k
libgdx 1.10.0 game framework 272k
alluxio 2.5.0-3 data orchestration 228k
plantuml 1.2021.7 UML generator 210k
grpc 1.38.1 RPC framework 195k
jenkins 2.299 automation 177k
jmeter 5.4.1 network analyzer 145k
jedit 5.6.0 text editor 125k
gephi 0.9.2 graph visualizer 120k
zookeeper 3.7.0 distributed computation 114k
selenium 3.141.59 browser automation 91k
okhttp 4.9.1 HTTP client 36k
jhotdraw 7.0.6 graph drawing 32k
arduino 1.8.15 development environment 27k
gson 2.8.7 serialization framework 25k
websocket 1.5.2 network framework 15k
picasso 2.8 image processing 9k
jpacman action game 3k
Total8,480k

3.2. Number of Obtained C-Type Expressions

The distribution of c-type expressions differs by application domain. The total number of c-types is roughly propotional to the project size.

alluxioarduinobinnavigephighidragrpcgsonhadoopignitejeditjenkinsjettyjhotdrawjitsijmeterjpacmankafkalibgdxnettyokhttppicassoplantumlseleniumtomcatwebsocketzookeeperklocPATHURLSQLHOSTPORTXCOORDYCOORDWIDTHHEIGHTYEARMONTHDAYOTHER#ctype20040060080010001200140016001800020040060080010001200140016001800200022002400kloc

3.3. Top C-Type Expressions for Each Project

PATH

ProjectTop Expressions
alluxiopath, mLocalUfsPath+ufsBase, base
arduinopath, PreferencesData.get("runtime.ide.path")
binnavifilename, directory, pathname
gephiSystem.getProperty("netbeans.user")
ghidragetTestDirectoryPath(), path, filename
grpcuri.getPath()
hadoopGenericTestUtils.getRandomizedTempPath()
ignitepath, U.defaultWorkDirectory(), fileName
jeditpath, dir, directory
jenkinsSystem.getProperty("user.home"), war
jettyfile.getParent()
jhotdrawprefs.get("projectFile", home)
jitsipath, localPath
jmeterfilename, path, file
kafkastoreDirectoryPath, argument
libgdxname, sourcePath, imagePath.replace('\textbackslash\textbackslash','/')
nettygetClass().getResource("test.crt").getFile()
plantumlfilename, newName
seleniumSystem.getProperty("java.io.tmpdir"), logName
tomcatpathname, path, docBase
zookeeperpath, KerberosTestUtils.getKeytabFile()

URL

ProjectTop Expressions
alluxiojournalDirectory, folder, inputDir
arduinocontribution.getUrl(), packageIndexURLString
binnaviurl, urlString
ghidraref, getAbsolutePath(), url.toExternalForm()
grpctarget, TARGET, oobTarget
gsonnextString, urlValue, uriValue
hadoopuri, url, s
igniteGridTestProperties.getProperty("p2p.uri.cls")
jeditpath, str, fileIcon
jenkinsurl, site.getData().core.url, plugin.url
jettyuri, inputUrl.toString(), s
jitsiurl, imagePath, sourceString
jmeterurl, LOCAL_HOST, requestPath
kafkaconfig.getString(METRICS_URL_CONFIG)
libgdxurl, URI, httpRequest.getUrl()+queryString
nettyURL, request.uri(), server
seleniumurl, baseUrl, (String)raw.get("uri")
tomcaturl, location, path
websocketuriField.getText(), uriinput.getText()
zookeeperurlStr

XCOORD

ProjectTop Expressions
arduinonoLeft, cancelLeft
binnavix, m_x
gephicurrentMouseX, x, bounds.x
ghidrax, center.x+deltaX, filterPanelBounds.x
jeditx, event.getX(), leftButtonWidth+leftWidth
jhotdrawevt.getX(), x, e.getX()
jitsix, button.getX(), dx
jmetergraphPanel.getLocation().x, cellRect.x, x
libgdxupButtonX, getWidth()-buttonSize.width-5, x
plantumle.getX()

WIDTH

ProjectTop Expressions
arduinowidth, imageW, Preferences.BUTTON_WIDTH
binnaviCOLORPANEL_WIDTH, TEXTFIELD_WIDTH, width
gephiw, constraintWidth, DEPTH
ghidrawidth, center.width, filterPanelBounds.width
jeditwidth, buttonSize.width, colWidth
jhotdrawframeWidth, r.width, bounds.width
jitsiMAX_MSG_PANE_WIDTH, WIDTH, width
jmetergraphPanel.width
libgdxwidth, buttonSize.width
plantumlnewWidth
tomcatWIDTH

3.4. Complex C-Type Expressions

PATH

URL

XCOORD

WIDTH

3.5. Complexity of C-Type Expressions

(n: number of terms)

0%20%40%60%80%100%PATHURLSQLHOSTPORTXCOORDYCOORDWIDTHHEIGHTYEARMONTHDAYn=1n=2n=3n=4n=5n=6n≧7

4. Inferring C-types

We tried to infer the c-types from the surface clue of expressions.

Basic strategy:

  1. Use the obtained c-type expressions (12k) as training set/test set.
  2. Extract the features from each expression.
  3. Use machine learning (decision tree) to identify the c-type from expressions.

Decision tree-based machine learning is used because

Common Top Words for C-Type Expressions:

C-TypeTop words (# Projects)
PATHget (21), path (21), file (20)
URLurl (19), get (18), string (18)
SQLget (6), query (5), create (3)
HOSThost (21), get (17), address (17)
PORTport (22), get (18), local (10)
XCOORDwidth (9), x (9), get (9)
YCOORDheight (9), y (9), get (8)
WIDTHwidth (13), get (11), size (10)
HEIGHTheight (12), get (11), size (10)
YEARyear (4), get (2), int (2)
MONTHjanuary (3), month (3), december (3)
DAYday (3), int (2), parse (2)

We used a dataflow diagram to extract features from expressions:

new File( ) config i getPath() Secondary identifiers Primary identifier
Dataflow diagram of "new File(config.getPath(i))"

Dependency Rules

ExpressionDependency
# (constant) #
A (variable access) A
A.B (field access) A → B
B(A) (method call) A → B()
A.B() (instance method call) A → B()
op A (applying a unary operator) A → op
A op B (applying a binary operator) A → op, B → op
B = A (assignment) A → B

We extract "primary identifier" and "secondary identifier(s)" from the above diagram:

We tested the obtained decision tree classifier with leave-one-project-out cross validation:

Classification Performance

C-TypePrecisionRecallF-score
PATH 68.9% 91.8% 78.8%
URL 61.3% 53.0% 56.8%
SQL 70.4% 80.6% 75.2%
HOST 70.0% 73.8% 71.8%
PORT 84.6% 87.5% 86.0%
XCOORD 95.7% 82.1% 88.3%
YCOORD 97.5% 79.4% 87.5%
WIDTH 92.0% 92.5% 92.2%
HEIGHT 90.4% 93.4% 91.9%
YEAR 100.0% 83.7% 91.1%
MONTH 100.0% 77.0% 87.0%
DAY 100.0% 61.1% 75.9%
Average 85.9% 79.6% 82.7%

The reason why URL c-type wasn't recognized very well: URL expressions often include HOST, PORT or PATH expressions, which confused the classifier:

"https://"+getHostName()+":"+getPort()+"/"+getPath()

5. Conclusion

Research Questions (again):

  1. What are common "high-level data types"?
    → The distribution of c-types depends on the domain of each project.
  2. How do programmers express these high-level types in source code?
  3. Is it possible to infer them from surface clues?
    → These questions are related. By using superficial features, we obtained a classifier with 83% F-score. This suggests that programmers tend to express the common c-types in a rather obvious way.

In future, we could use the obtained classifier to infer c-types in other parts of code.

5.1. Threats to Validity

Note to the listners of the previous presentation: This work is related to the previous presentation (Use-Flow Graph Analysis) only in that consistent variable naming would help the accuracy of c-type identification. Also, both works use the same tooling (dataflow graph). Other than that, two works are handling different problems.


Yusuke Shinyama