Compute Source Code Similarity I have tried to analyze source code by various methods recently. In this post, I will show you my artifact - Compute source code similarity based on Jaccard Similarity Coefficient. Jaccard Similarity Coefficient is an common and quick techniqueto compute similarity between 2 documents. You can see more details in Jaccard index or MinHash . My similarity computation strategy is below: Calculate hash value for each line for each files. Create set which contains hashes calculated in the above process for the each files. Compute Jaccard Similarity Coefficient for all combinations of the files. Source Code Now I can show you code snippets for compute source code similarity based on Jaccard Similarity Coefficient. Jaccard Simlarity package com.dukesoftware.utils.text; import java.util.HashSet; import java.util.Set; /** * Jaccard similarity coefficient http://en.wikipedia.org/wiki/MinHash */ public class JaccardSimlarity<T> { priv...
IT関連の技術やプログラミングを中心に記事を書いています。ハードウェアも好きなので、日々のちょっとしたお役立ち情報も投稿しています。