https://euske.github.io/
String username = getCurrentUserName();
String path = "/home/"+username+"/user.cfg";
// Unsafe path: extra check is needed!
File config = new File(path);
A better (but cumbersome) code would be:
User user = getCurrentUser();
Path path = Paths.get(user.getHomeDirectory(), "user.cfg");
// Path is guaranteed to be safe.
File config = new File(path);
String and int types, however,
are still used for a wide variety of purposes.
String and int)
are used in various software projects.
When programmers make an API call, they are aware of the role of its arguments:
String x = "foo/bar.txt";
var f = new java.io.File(x); // x is a pathname
x refers to a String,
but it's clearly a pathname.
We studied Java Standard API and defined 12 commonly used c-types:
| C-Type | Language Type | Description | # Methods |
|---|---|---|---|
| PATH | String | Path name | 14 |
| URL | String | URL/URI | 4 |
| SQL | String | SQL statement | 10 |
| HOST | String | Host name | 17 |
| PORT | int | Port number | 25 |
| XCOORD | int | X coordinate (GUI) | 25 |
| YCOORD | int | Y coordinate (GUI) | 25 |
| WIDTH | int | width (GUI) | 24 |
| HEIGHT | int | height (GUI) | 24 |
| YEAR | int | year | 18 |
| MONTH | int | month | 14 |
| DAY | int | day | 18 |
| Total | 218 | ||
Method examples:
new java.io.File(PATH)
new java.net.URI(URL)
java.sql.Statement.execute(SQL)
java.net.InetAddress.getByName(HOST)
new java.net.Socket(HOST, PORT)
new java.awt.Point(XCOORD, YCOORD)
new java.awt.Dimension(WIDTH, HEIGHT)
new java.util.Date(YEAR, MONTH, DAY)
java.util.Date.setYear(YEAR)
java.time.LocalDate.of(YEAR, MONTH, DAY)
The criterion that we used are:
We did the following experiments:
zookeeper/.../TxnLogToolkit.java:
...
File file = new File(dir.getPath() + File.separator + Util.makeLogName(zxid)[PATH]);
jitsi/.../SIPCommSplitPaneDivider.java:
...
rightButton.setBounds((insets.left * 2) + leftSize.width + rightSize.width[XCOORD],
y[YCOORD], rightSize.width[WIDTH], rightSize.height[HEIGHT]);
| Project | Description | LoC |
|---|---|---|
| hadoop 3.3.1 | distributed computation | 1,789k |
| ghidra 10.0 | binary analyzer | 1,588k |
| ignite 2.10.0 | distributed database | 1,165k |
| jetty 11.0.5 | web container | 441k |
| kafka 2.7.1 | stream processing | 384k |
| tomcat 8.5.68 | web server | 349k |
| jitsi 2.10 | video conference | 327k |
| binnavi 6.1.0 | binary analyzer | 309k |
| netty 4.1.65 | network library | 303k |
| libgdx 1.10.0 | game framework | 272k |
| alluxio 2.5.0-3 | data orchestration | 228k |
| plantuml 1.2021.7 | UML generator | 210k |
| grpc 1.38.1 | RPC framework | 195k |
| jenkins 2.299 | automation | 177k |
| jmeter 5.4.1 | network analyzer | 145k |
| jedit 5.6.0 | text editor | 125k |
| gephi 0.9.2 | graph visualizer | 120k |
| zookeeper 3.7.0 | distributed computation | 114k |
| selenium 3.141.59 | browser automation | 91k |
| okhttp 4.9.1 | HTTP client | 36k |
| jhotdraw 7.0.6 | graph drawing | 32k |
| arduino 1.8.15 | development environment | 27k |
| gson 2.8.7 | serialization framework | 25k |
| websocket 1.5.2 | network framework | 15k |
| picasso 2.8 | image processing | 9k |
| jpacman | action game | 3k |
| Total | 8,480k | |
The distribution of c-type expressions differs by application domain. The total number of c-types is roughly propotional to the project size.
| Project | Top Expressions |
|---|---|
| alluxio | path, mLocalUfsPath+ufsBase, base |
| arduino | path, PreferencesData.get("runtime.ide.path") |
| binnavi | filename, directory, pathname |
| gephi | System.getProperty("netbeans.user") |
| ghidra | getTestDirectoryPath(), path, filename |
| grpc | uri.getPath() |
| hadoop | GenericTestUtils.getRandomizedTempPath() |
| ignite | path, U.defaultWorkDirectory(), fileName |
| jedit | path, dir, directory |
| jenkins | System.getProperty("user.home"), war |
| jetty | file.getParent() |
| jhotdraw | prefs.get("projectFile", home) |
| jitsi | path, localPath |
| jmeter | filename, path, file |
| kafka | storeDirectoryPath, argument |
| libgdx | name, sourcePath, imagePath.replace('\textbackslash\textbackslash','/') |
| netty | getClass().getResource("test.crt").getFile() |
| plantuml | filename, newName |
| selenium | System.getProperty("java.io.tmpdir"), logName |
| tomcat | pathname, path, docBase |
| zookeeper | path, KerberosTestUtils.getKeytabFile() |
| Project | Top Expressions |
|---|---|
| alluxio | journalDirectory, folder, inputDir |
| arduino | contribution.getUrl(), packageIndexURLString |
| binnavi | url, urlString |
| ghidra | ref, getAbsolutePath(), url.toExternalForm() |
| grpc | target, TARGET, oobTarget |
| gson | nextString, urlValue, uriValue |
| hadoop | uri, url, s |
| ignite | GridTestProperties.getProperty("p2p.uri.cls") |
| jedit | path, str, fileIcon |
| jenkins | url, site.getData().core.url, plugin.url |
| jetty | uri, inputUrl.toString(), s |
| jitsi | url, imagePath, sourceString |
| jmeter | url, LOCAL_HOST, requestPath |
| kafka | config.getString(METRICS_URL_CONFIG) |
| libgdx | url, URI, httpRequest.getUrl()+queryString |
| netty | URL, request.uri(), server |
| selenium | url, baseUrl, (String)raw.get("uri") |
| tomcat | url, location, path |
| websocket | uriField.getText(), uriinput.getText() |
| zookeeper | urlStr |
| Project | Top Expressions |
|---|---|
| arduino | noLeft, cancelLeft |
| binnavi | x, m_x |
| gephi | currentMouseX, x, bounds.x |
| ghidra | x, center.x+deltaX, filterPanelBounds.x |
| jedit | x, event.getX(), leftButtonWidth+leftWidth |
| jhotdraw | evt.getX(), x, e.getX() |
| jitsi | x, button.getX(), dx |
| jmeter | graphPanel.getLocation().x, cellRect.x, x |
| libgdx | upButtonX, getWidth()-buttonSize.width-5, x |
| plantuml | e.getX() |
| Project | Top Expressions |
|---|---|
| arduino | width, imageW, Preferences.BUTTON_WIDTH |
| binnavi | COLORPANEL_WIDTH, TEXTFIELD_WIDTH, width |
| gephi | w, constraintWidth, DEPTH |
| ghidra | width, center.width, filterPanelBounds.width |
| jedit | width, buttonSize.width, colWidth |
| jhotdraw | frameWidth, r.width, bounds.width |
| jitsi | MAX_MSG_PANE_WIDTH, WIDTH, width |
| jmeter | graphPanel.width |
| libgdx | width, buttonSize.width |
| plantuml | newWidth |
| tomcat | WIDTH |
mLocalUfsPath + ufsBase
selectedFile.getAbsolutePath() + PREFERENCES_FILE_EXTENSION
dir.getPath() + DIR_FAILURE_SUFFIX
U.defaultWorkDirectory() + separatorChar + DEFAULT_TARGET_FOLDER + separatorChar
url.toExternalForm().substring(GhidraURL.PROTOCOL.length() + 1)
str + KMSRESTConstants.SERVICE_VERSION + "/"
newOrigin(getScheme(),getHost(),getPort()).asString() + path
base + configFile
center.x + center.width
leftButtonWidth + leftWidth
evt.getX() - getInsets().left
prefs.getInt(name+".x", 0)
Math.max(contentWidth, menuWidth) + insets.left + insets.right
TITLE_X_OFFSET + titlePreferredSize.width
width + insets.left + insets.right + 2
(int)(bounds.getWidth() * percent)
(n: number of terms)
We tried to infer the c-types from the surface clue of expressions.
Basic strategy:
Decision tree-based machine learning is used because
| C-Type | Top words (# Projects) |
|---|---|
PATH | get (21), path (21), file (20) |
URL | url (19), get (18), string (18) |
SQL | get (6), query (5), create (3) |
HOST | host (21), get (17), address (17) |
PORT | port (22), get (18), local (10) |
XCOORD | width (9), x (9), get (9) |
YCOORD | height (9), y (9), get (8) |
WIDTH | width (13), get (11), size (10) |
HEIGHT | height (12), get (11), size (10) |
YEAR | year (4), get (2), int (2) |
MONTH | january (3), month (3), december (3) |
DAY | day (3), int (2), parse (2) |
We used a dataflow diagram to extract features from expressions:
new File(config.getPath(i))"
| Expression | Dependency |
|---|---|
# (constant) | # |
A (variable access) | A |
A.B (field access) | A → B |
B(A) (method call) | A → B() |
A.B() (instance method call) | A → B() |
op A (applying a unary operator) | A → op |
A op B (applying a binary operator) | A → op, B → op |
B = A (assignment) | A → B |
We extract "primary identifier" and "secondary identifier(s)" from the above diagram:
getPath()
config, i
We tested the obtained decision tree classifier with leave-one-project-out cross validation:
| C-Type | Precision | Recall | F-score |
|---|---|---|---|
PATH | 68.9% | 91.8% | 78.8% |
URL | 61.3% | 53.0% | 56.8% |
SQL | 70.4% | 80.6% | 75.2% |
HOST | 70.0% | 73.8% | 71.8% |
PORT | 84.6% | 87.5% | 86.0% |
XCOORD | 95.7% | 82.1% | 88.3% |
YCOORD | 97.5% | 79.4% | 87.5% |
WIDTH | 92.0% | 92.5% | 92.2% |
HEIGHT | 90.4% | 93.4% | 91.9% |
YEAR | 100.0% | 83.7% | 91.1% |
MONTH | 100.0% | 77.0% | 87.0% |
DAY | 100.0% | 61.1% | 75.9% |
| Average | 85.9% | 79.6% | 82.7% |
The reason why URL c-type wasn't recognized very well: URL expressions often include HOST, PORT or PATH expressions, which confused the classifier:
"https://"+getHostName()+":"+getPort()+"/"+getPath()
In future, we could use the obtained classifier to infer c-types in other parts of code.
Note to the listners of the previous presentation: This work is related to the previous presentation (Use-Flow Graph Analysis) only in that consistent variable naming would help the accuracy of c-type identification. Also, both works use the same tooling (dataflow graph). Other than that, two works are handling different problems.