スキップしてメイン コンテンツに移動

投稿

ラベル(Jaccard Index)が付いた投稿を表示しています

Java: Compute Source Code Similarity Based on Jaccard Similarity Coefficient

Compute Source Code Similarity I have tried to analyze source code by various methods recently. In this post, I will show you my artifact - Compute source code similarity based on Jaccard Similarity Coefficient. Jaccard Similarity Coefficient is an common and quick techniqueto compute similarity between 2 documents. You can see more details in Jaccard index or MinHash . My similarity computation strategy is below: Calculate hash value for each line for each files. Create set which contains hashes calculated in the above process for the each files. Compute Jaccard Similarity Coefficient for all combinations of the files. Source Code Now I can show you code snippets for compute source code similarity based on Jaccard Similarity Coefficient. Jaccard Simlarity package com.dukesoftware.utils.text; import java.util.HashSet; import java.util.Set; /** * Jaccard similarity coefficient http://en.wikipedia.org/wiki/MinHash */ public class JaccardSimlarity<T> { priv