Recently I came across a question online about ‘How do I do a Fuzzy Match of Company Names?
It reminded me of the days before Match2Lists, when we used to suffer with exactly this problem.
Having spent years developing Business Intelligence solutions for multi-national ITC enterprises, I’ve been constantly faced with the task of linking together company information from disparate datasets so as to provide a complete 360 degree view of a customer.
This often involved integrating the Financial Sales History from Oracle or SAP, the CRM data from Siebel or SalesForce, often matching this to information from external Data Providers like Dun and Bradstreet, enabling us to build buying propensity models and provide insightful analytics to business leaders.
Why is Matching Company Names such a Big Issue?
Well, let’s start by determining what is considered successful.
Is it satisfactory to say you matched 80% of the records? What if the 20% that remain unmatched are your biggest customers and responsible for 80% of your revenue? How do you assess the importance of the unmatched data? Does the unmatched data contain strategically important Customers or Prospects?
By matching the data with external data sources you can bring in information to help you determine the importance of these customers or prospects. Looking at Employee Numbers, Industries and Subsidiary or Parent Company Hierarchies can help you to prioritise your efforts.
How do we Match the Company Names?
Matching data that is exactly the same is easy, but what about the non-exact matches? We can use Fuzzy Logic for those, right?
In the forum discussion I came across, they were discussing the different types of algorithms and which was best to use, including Soundex, MetaPhone and Levenshtein. It was a pretty typical discussion that I’ve come across countless times, but the more important issue –and one that is rarely, if ever, discussed– is how this logic is to be applied, regardless of which specific algorithm or algorithms are chosen.
Let me demonstrate what I mean. Here’s an example of Soundex applied to some Big Company Names.
|Bank of Canada||B521|
|Canada, Bank Of||C531|
|Bank of India||B521|
|Bank of Ireland||B521|
|Bank of England||B521|
OK, before we start deriding Soundex, and I know that it’s a much maligned function, let’s take the same Company Names and look at MetaPhone. The arguments are valid for the MetaPhone derivatives such as Double MetaPhone.
|Bank of Canada||BNKFKNT|
|Canada, Bank Of||KNTBNKF|
|Bank of India||BNKFNT|
|Bank of Ireland||BNKFRLNT|
|Bank of England||BNKFNKLNT|
So you can see that with Soundex we would have got some false positive matches for Bank of Ireland, Bank of England and Bank of India, which is not the case with MetaPhone. But the MetaPhone algorithm didn’t suggest that ‘Samsung Electonics’ and ‘Samsung Elec.’ could be matched, and it didn’t match the ‘Bank of Canada’ with ‘Canada, Bank Of’.
Ok, what about Levenshtein? Well, in case you’re not familiar with it, the Levenshtein function returns the number of characters that you would need to change to turn one string into another. The lower the number, the more probability that the data matches.
- Levenshtein(‘Bank of Canada’, ‘Canada, Bank Of’) returns a value of 10 (poor match)
- Levenshtein(‘Bank of India, ‘Bank of Ireland’) returns a value of 5 (poor match)
- Levenshtein(‘Samsung Electonics’, ‘Samsung Elec.’) returns a value of 7 (poor match)
- Levenshtein(‘Visa’, ‘Vista’) returns a value of 1 (Potential match)
- Levenshtein(‘Visa’, ‘Visa Card Services’) returns a value of 14 (poor match)
In practice none of these functions can be used in this way without serious scrutiny of the results, and I found that using matching software that was built on this type of logic produced far too many potential candidates and missed lots of what we would consider perfect matches.
So, What’s the Answer?
Our approach with Match2Lists includes the following;
- Data Standardisation
- Probabilistic Logic
- Fuzzy Logic
- Extensive Knowledge Base
- Ability to Learn from Experience
- Leveraging of Corroborative Information
- Iterative Approach to Matching
- Powerful Visualisation
Data Standardization helps address issues with common abbreviations such as Ltd, Limited, Corp and Corporation, Inc and Incorporated. Unlike some matching systems, we do not advocate removing the legal entity information, as we would rather match ‘Siemens Ag‘ exactly where available rather than ‘Siemens Inc’ or ‘Siemens Corporaton’; we use Probabilistic Logic to manage this.
With Probabilistic Logic we examine the Company Name and determine which elements are of most relevance for matching, and prioritize these. For example in ‘The Procter and Gamble Company’ we would determine ‘Procter’ and ‘Gamble’ to be more important keywords in a matching context, especially when used in conjunction with each other.
By combining fuzzy logic with probabilistic logic, we are better able to ascertain the probability of the data matching. For example, ‘Proctor & Gamble’ would be seen as a very probable match even with the misspelling of ‘Proctor’. Fuzzy Logic is very powerful when used in conjunction with Probabilistic Logic, as it substantially limits the quantity of False Positive matches that Fuzzy Logic is prone to provide.
Extensive Knowledge Base
With our experience of B2B Data Matching over many years, we have built an extensive library of knowledge for common acronyms used for various large businesses. Some examples include ‘GSK’ = ‘GlaxoSmithKline’, ‘BBC’ = ‘British Broadcasting Company’, ‘GE’ = ‘General Electric’, and many many more.
For a large corporate business this knowledge is a must, as it’s not acceptable to leave ‘HP’ unmatched beacuse we had ‘Hewlett Packard’ listed instead.
This is an ever growing library, and covers international and national companies as acronyms can mean different things in different countries.
Leveraging of Corroborative Information
By using other data such as address information, telephone numbers, Longitude/Latitude coordinates, City etc. we can find potential matches where the company name is very different but where a human being can use their own common sense or local knowledge to confirm that this is indeed the same business.
By way of an example, I have recently been matching inventories of Hotel names and in some instances the Hotel name would be the parent chain name such as ‘Best Western’ in one list, and a completely different name in the other list.
Iterative Approach to Matching
Most matching applications run a single pass on the data using one set of criteria, and then output the results for scrutiny. We found this resulted in many good matches being overlooked, and in some cases inferior matches being produced while better matches were overlooked.
By implementing functionality to dynamically change the matching criteria, adjusting the importance of the collaborative information and being able to continually append the results to the same project, we were able to find more matches and ensure we approved the best possible matches.
Lastly, and a real game changer for us, is the implementation of a powerful User Interface to visualize the match candidates. By being able to visually scrutinize the results we were able to determine the best criteria to use, and to automatically approve many more results than before. This Visualization meant that we achieved significant reductions in the cost of our matching projects and were able to deliver the results to our customers in record breaking time.
Get in touch if you would like to learn more about our approach to B2B matching, or if you have any questions.