Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.
I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.
Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.
Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)
?
Is Levenshtein score a poor choice of algorithms for this problem?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…