This is a continuation of the previous blog post on fuzzy search. We use fuzzy matching algorithms in fuzzy search to come up with the search results. The strength of a fuzzy search algorithm heavily depends on the strength of the fuzzy matching algorithm that is being used. The concept of matching refers to an input being matched to a set of entries, or records, in your database to come up with the best possible match. We encounter this scenario very frequently in our everyday lives. Whenever you are looking up a word in the dictionary or when somebody is looking up your account during a customer service call, some form of matching is being used to get the answers. So how exactly does fuzzy matching work? What’s the big deal here?
What is the challenge?
Before we go into fuzzy matching, we need to understand what is the problem that needs to be solved. There is a concept called record linkage, which basically involves searching files for records that belong to the same individual. This is frequently used when we talk about database and retrieval. For example, we might be conducting a study about the ages of people to determine who is eligible to vote. This is where we use record linkage of our data set with age data set to determine who is eligible to vote. When we talk about record linkage, there are two ways to do it: deterministic and probabilistic.
Deterministic record linkage is where we look for exact agreement on one or more matching variables between files. For example, we might simply use an ID number common to two files. However, coding errors of the ID number on one file would mean that some true matches will be missed. Deterministic record linkage is a straightforward approach which does exactly what it’s told. Look up something and tell us if it’s found! Probabilistic record linkage, on the other hand, takes a lot of different factors into account before returning search results. This is called fuzzy matching.
Say hello to fuzzy matching
Probabilistic record linkage, also called fuzzy matching, takes a different approach to the record linkage problem. It takes a wider range of potential identifiers into accounts, and computes weights for each of them based on its estimated ability to correctly identify a match or a non-match. To understand this, let’s say you are looking for a person in the database. All you have if the person’s last name, gender and city of residence. As we can see here, when you are searching for someone, the person’s last name would be much more relevant than the gender. So fuzzy matching algorithm would allocate higher weight to this last name identifier and less weight to the gender identifier.
How does it work?
Fuzzy matching uses these weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be non-matches; pairs that fall between these two thresholds are considered to be “possible matches” and can be dealt with accordingly (e.g., human reviewed, linked, or not linked, depending on the requirements). This is the “fuzzy” area. The way in which we deal with this heavily influences the quality of the overall system. Whereas deterministic record linkage requires a series of potentially complex rules to be programmed ahead of time, probabilistic record linkage methods can be “trained” to perform well with much less human intervention.
Probabilistic record linkage uses information on a greater number of matching variables, and allows for the amount of information provided by any agreement on matching variables. For example, agreement on ID number is more suggestive of a match than is agreement on, say, gender. If two people have the same gender, it doesn’t tell you anything. If two people have the same ID number, there is a very high chance that it’s the same person. Also, agreements on rare values of a given matching variable (e.g. uncommon last names like Panzarella) are more suggestive than agreements on common values (e.g. a very common last name like Smith).
An example
At the heart of probabilistic record linkage are two sets of probabilities. Consider the matching variable month-of-birth. The probability of this variable agreeing purely by chance is about 1/12 = 0.083. This means that the probability that two people have the same birthday month is 0.083. This is the first probability. The second probability is the probability of agreement for a given matching variable when the comparison pair is a match. Ideally, this probability should be 1.0. But since all matching variables are prone to errors due to miscoding, this probability is less than 1.0. The value of this probability is estimated during the specification of the record linkage strategy based upon prior information and the proportion of agreements among the comparison pairs accepted as links. In this example, let’s assume that this probability is 0.93.
These two probabilities are then used to determine agreement weights. In our example, a comparison pair that agreed on month of birth would be assigned a weight of, let’s say, 2.71 and a comparison pair that disagreed on month of birth would be assigned a weight of −3.15. The setting of the two probabilities mentioned earlier and the corresponding weights is repeated for all matching variables, and possibly additionally for all values of each of the matching variables. The total weight for a given comparison pair is simply the sum of the agreement weights for each matching variable. The total weight will be a large positive number if most matching variables agree, or a large negative number if most matching variables disagree. You can then use the threshold to determine how to proceed.
————————————————————————————————-
Hi Prateek,
I’m blogging on a similar topic, at http://razvanml.github.io/reco-link/
You say if two people have the same gender is not helpful, which is right; however, if they have a difference gender this is a useful information, it means they cannot be the same person, otherwise you have stronger reasons to believe so.
My approach in linking records is more probabilistic, at the end I compute the probability of two records to match.
Thanks,
Razvan
thank you for sharing 🙂