https://euske.github.io/
String username = getCurrentUserName();
String path = "/home/"+username+"/user.cfg";
// Unsafe path: extra check is needed!
File config = new File(path);
A better (but cumbersome) code would be:
User user = getCurrentUser();
Path path = Paths.get(user.getHomeDirectory(), "user.cfg");
// Path is guaranteed to be safe.
File config = new File(path);
String
and int
types, however,
are still used for a wide variety of purposes.
String
and int
)
are used in various software projects.
When programmers make an API call, they are aware of the role of its arguments:
String x = "foo/bar.txt";
var f = new java.io.File(x); // x is a pathname
x
refers to a String
,
but it's clearly a pathname.
We studied Java Standard API and defined 12 commonly used c-types:
C-Type | Language Type | Description | # Methods |
---|---|---|---|
PATH | String | Path name | 14 |
URL | String | URL/URI | 4 |
SQL | String | SQL statement | 10 |
HOST | String | Host name | 17 |
PORT | int | Port number | 25 |
XCOORD | int | X coordinate (GUI) | 25 |
YCOORD | int | Y coordinate (GUI) | 25 |
WIDTH | int | width (GUI) | 24 |
HEIGHT | int | height (GUI) | 24 |
YEAR | int | year | 18 |
MONTH | int | month | 14 |
DAY | int | day | 18 |
Total | 218 |
Method examples:
new java.io.File(PATH)
new java.net.URI(URL)
java.sql.Statement.execute(SQL)
java.net.InetAddress.getByName(HOST)
new java.net.Socket(HOST, PORT)
new java.awt.Point(XCOORD, YCOORD)
new java.awt.Dimension(WIDTH, HEIGHT)
new java.util.Date(YEAR, MONTH, DAY)
java.util.Date.setYear(YEAR)
java.time.LocalDate.of(YEAR, MONTH, DAY)
The criterion that we used are:
We did the following experiments:
zookeeper/.../TxnLogToolkit.java:
...
File file = new File(dir.getPath() + File.separator + Util.makeLogName(zxid)[PATH]);
jitsi/.../SIPCommSplitPaneDivider.java: ... rightButton.setBounds((insets.left * 2) + leftSize.width + rightSize.width[XCOORD], y[YCOORD], rightSize.width[WIDTH], rightSize.height[HEIGHT]);
Project | Description | LoC |
---|---|---|
hadoop 3.3.1 | distributed computation | 1,789k |
ghidra 10.0 | binary analyzer | 1,588k |
ignite 2.10.0 | distributed database | 1,165k |
jetty 11.0.5 | web container | 441k |
kafka 2.7.1 | stream processing | 384k |
tomcat 8.5.68 | web server | 349k |
jitsi 2.10 | video conference | 327k |
binnavi 6.1.0 | binary analyzer | 309k |
netty 4.1.65 | network library | 303k |
libgdx 1.10.0 | game framework | 272k |
alluxio 2.5.0-3 | data orchestration | 228k |
plantuml 1.2021.7 | UML generator | 210k |
grpc 1.38.1 | RPC framework | 195k |
jenkins 2.299 | automation | 177k |
jmeter 5.4.1 | network analyzer | 145k |
jedit 5.6.0 | text editor | 125k |
gephi 0.9.2 | graph visualizer | 120k |
zookeeper 3.7.0 | distributed computation | 114k |
selenium 3.141.59 | browser automation | 91k |
okhttp 4.9.1 | HTTP client | 36k |
jhotdraw 7.0.6 | graph drawing | 32k |
arduino 1.8.15 | development environment | 27k |
gson 2.8.7 | serialization framework | 25k |
websocket 1.5.2 | network framework | 15k |
picasso 2.8 | image processing | 9k |
jpacman | action game | 3k |
Total | 8,480k |
The distribution of c-type expressions differs by application domain. The total number of c-types is roughly propotional to the project size.
Project | Top Expressions |
---|---|
alluxio | path, mLocalUfsPath+ufsBase, base |
arduino | path, PreferencesData.get("runtime.ide.path") |
binnavi | filename, directory, pathname |
gephi | System.getProperty("netbeans.user") |
ghidra | getTestDirectoryPath(), path, filename |
grpc | uri.getPath() |
hadoop | GenericTestUtils.getRandomizedTempPath() |
ignite | path, U.defaultWorkDirectory(), fileName |
jedit | path, dir, directory |
jenkins | System.getProperty("user.home"), war |
jetty | file.getParent() |
jhotdraw | prefs.get("projectFile", home) |
jitsi | path, localPath |
jmeter | filename, path, file |
kafka | storeDirectoryPath, argument |
libgdx | name, sourcePath, imagePath.replace('\textbackslash\textbackslash','/') |
netty | getClass().getResource("test.crt").getFile() |
plantuml | filename, newName |
selenium | System.getProperty("java.io.tmpdir"), logName |
tomcat | pathname, path, docBase |
zookeeper | path, KerberosTestUtils.getKeytabFile() |
Project | Top Expressions |
---|---|
alluxio | journalDirectory, folder, inputDir |
arduino | contribution.getUrl(), packageIndexURLString |
binnavi | url, urlString |
ghidra | ref, getAbsolutePath(), url.toExternalForm() |
grpc | target, TARGET, oobTarget |
gson | nextString, urlValue, uriValue |
hadoop | uri, url, s |
ignite | GridTestProperties.getProperty("p2p.uri.cls") |
jedit | path, str, fileIcon |
jenkins | url, site.getData().core.url, plugin.url |
jetty | uri, inputUrl.toString(), s |
jitsi | url, imagePath, sourceString |
jmeter | url, LOCAL_HOST, requestPath |
kafka | config.getString(METRICS_URL_CONFIG) |
libgdx | url, URI, httpRequest.getUrl()+queryString |
netty | URL, request.uri(), server |
selenium | url, baseUrl, (String)raw.get("uri") |
tomcat | url, location, path |
websocket | uriField.getText(), uriinput.getText() |
zookeeper | urlStr |
Project | Top Expressions |
---|---|
arduino | noLeft, cancelLeft |
binnavi | x, m_x |
gephi | currentMouseX, x, bounds.x |
ghidra | x, center.x+deltaX, filterPanelBounds.x |
jedit | x, event.getX(), leftButtonWidth+leftWidth |
jhotdraw | evt.getX(), x, e.getX() |
jitsi | x, button.getX(), dx |
jmeter | graphPanel.getLocation().x, cellRect.x, x |
libgdx | upButtonX, getWidth()-buttonSize.width-5, x |
plantuml | e.getX() |
Project | Top Expressions |
---|---|
arduino | width, imageW, Preferences.BUTTON_WIDTH |
binnavi | COLORPANEL_WIDTH, TEXTFIELD_WIDTH, width |
gephi | w, constraintWidth, DEPTH |
ghidra | width, center.width, filterPanelBounds.width |
jedit | width, buttonSize.width, colWidth |
jhotdraw | frameWidth, r.width, bounds.width |
jitsi | MAX_MSG_PANE_WIDTH, WIDTH, width |
jmeter | graphPanel.width |
libgdx | width, buttonSize.width |
plantuml | newWidth |
tomcat | WIDTH |
mLocalUfsPath + ufsBase
selectedFile.getAbsolutePath() + PREFERENCES_FILE_EXTENSION
dir.getPath() + DIR_FAILURE_SUFFIX
U.defaultWorkDirectory() + separatorChar + DEFAULT_TARGET_FOLDER + separatorChar
url.toExternalForm().substring(GhidraURL.PROTOCOL.length() + 1)
str + KMSRESTConstants.SERVICE_VERSION + "/"
newOrigin(getScheme(),getHost(),getPort()).asString() + path
base + configFile
center.x + center.width
leftButtonWidth + leftWidth
evt.getX() - getInsets().left
prefs.getInt(name+".x", 0)
Math.max(contentWidth, menuWidth) + insets.left + insets.right
TITLE_X_OFFSET + titlePreferredSize.width
width + insets.left + insets.right + 2
(int)(bounds.getWidth() * percent)
(n: number of terms)
We tried to infer the c-types from the surface clue of expressions.
Basic strategy:
Decision tree-based machine learning is used because
C-Type | Top words (# Projects) |
---|---|
PATH | get (21), path (21), file (20) |
URL | url (19), get (18), string (18) |
SQL | get (6), query (5), create (3) |
HOST | host (21), get (17), address (17) |
PORT | port (22), get (18), local (10) |
XCOORD | width (9), x (9), get (9) |
YCOORD | height (9), y (9), get (8) |
WIDTH | width (13), get (11), size (10) |
HEIGHT | height (12), get (11), size (10) |
YEAR | year (4), get (2), int (2) |
MONTH | january (3), month (3), december (3) |
DAY | day (3), int (2), parse (2) |
We used a dataflow diagram to extract features from expressions:
new File(config.getPath(i))
"
Expression | Dependency |
---|---|
# (constant) | # |
A (variable access) | A |
A.B (field access) | A → B |
B(A) (method call) | A → B() |
A.B() (instance method call) | A → B() |
op A (applying a unary operator) | A → op |
A op B (applying a binary operator) | A → op, B → op |
B = A (assignment) | A → B |
We extract "primary identifier" and "secondary identifier(s)" from the above diagram:
getPath()
config
, i
We tested the obtained decision tree classifier with leave-one-project-out cross validation:
C-Type | Precision | Recall | F-score |
---|---|---|---|
PATH | 68.9% | 91.8% | 78.8% |
URL | 61.3% | 53.0% | 56.8% |
SQL | 70.4% | 80.6% | 75.2% |
HOST | 70.0% | 73.8% | 71.8% |
PORT | 84.6% | 87.5% | 86.0% |
XCOORD | 95.7% | 82.1% | 88.3% |
YCOORD | 97.5% | 79.4% | 87.5% |
WIDTH | 92.0% | 92.5% | 92.2% |
HEIGHT | 90.4% | 93.4% | 91.9% |
YEAR | 100.0% | 83.7% | 91.1% |
MONTH | 100.0% | 77.0% | 87.0% |
DAY | 100.0% | 61.1% | 75.9% |
Average | 85.9% | 79.6% | 82.7% |
The reason why URL c-type wasn't recognized very well: URL expressions often include HOST, PORT or PATH expressions, which confused the classifier:
"https://"+getHostName()+":"+getPort()+"/"+getPath()
In future, we could use the obtained classifier to infer c-types in other parts of code.
Note to the listners of the previous presentation: This work is related to the previous presentation (Use-Flow Graph Analysis) only in that consistent variable naming would help the accuracy of c-type identification. Also, both works use the same tooling (dataflow graph). Other than that, two works are handling different problems.