• News and feature updates
  • Tutorials and Tips and tricks
  • General discussions and opinions

B2B Data Matching (Fuzzy Matching & Company Names)

August 17, 2012 |  by  |  Data Matching

Recently I came across a question online about ‘How do I do a Fuzzy Match of Company Names?

It reminded me of the days before Match2Lists, when we used to suffer with exactly this problem.

Having spent years developing Business Intelligence solutions for multi-national ITC enterprises, I’ve been constantly faced with the task of linking together company information from disparate datasets so as to provide a complete 360 degree view of a customer.

This often involved integrating the Financial Sales History from Oracle or SAP, the CRM data from Siebel or SalesForce, often matching this to information from external Data Providers like Dun and Bradstreet, enabling us to build buying propensity models and provide insightful analytics to business leaders.

Why is Matching Company Names such a Big Issue?

Well, let’s start by determining what is considered successful.

Is it satisfactory to say you matched 80% of the records? What if the 20% that remain unmatched are your biggest customers and responsible for 80% of your revenue? How do you assess the importance of the unmatched data? Does the unmatched data contain strategically important Customers or Prospects?

By matching the data with external data sources you can bring in information to help you determine the importance of these customers or prospects. Looking at Employee Numbers, Industries and Subsidiary or Parent Company Hierarchies can help you to prioritise your efforts.

How do we Match the Company Names?

Matching data that is exactly the same is easy, but what about the non-exact matches? We can use Fuzzy Logic for those, right?

In the forum discussion I came across, they were discussing the different types of algorithms and which was best to use, including Soundex, MetaPhone and Levenshtein. It was a pretty typical discussion that I’ve come across countless times, but the more important issue –and one that is rarely, if ever, discussed– is how this logic is to be applied, regardless of which specific algorithm or algorithms are chosen.

Let me demonstrate what I mean. Here’s an example of Soundex applied to some Big Company Names.

Company Name Soundex
Bank of Canada B521
Canada, Bank Of C531
Bank of India B521
Bank of Ireland B521
Bank of England B521
Samsung Electronics S524
Samsung Elec. S524

OK, before we start deriding Soundex, and I know that it’s a much maligned function, let’s take the same Company Names and look at MetaPhone. The arguments are valid for the MetaPhone derivatives such as Double MetaPhone.

Company Name MetaPhone
Bank of Canada BNKFKNT
Canada, Bank Of KNTBNKF
Bank of India BNKFNT
Bank of Ireland BNKFRLNT
Bank of England BNKFNKLNT
Samsung Electronics SMSNJLKTRNK
Samsung Elec. SMSNJLK

So you can see that with Soundex we would have got some false positive matches for Bank of Ireland, Bank of England and Bank of India, which is not the case with MetaPhone. But the MetaPhone algorithm didn’t suggest that ‘Samsung Electonics’ and ‘Samsung Elec.’ could be matched, and it didn’t match the ‘Bank of Canada’ with ‘Canada, Bank Of’.

Ok, what about Levenshtein? Well, in case you’re not familiar with it, the Levenshtein function returns the number of characters that you would need to change to turn one string into another. The lower the number, the more probability that the data matches.

For example:

  • Levenshtein(‘Bank of Canada’, ‘Canada, Bank Of’) returns a value of 10 (poor match)
  • Levenshtein(‘Bank of India, ‘Bank of Ireland’) returns a value of 5 (poor match)
  • Levenshtein(‘Samsung Electonics’, ‘Samsung Elec.’) returns a value of 7 (poor match)
  • Levenshtein(‘Visa’, ‘Vista’) returns a value of 1 (Potential match)
  • Levenshtein(‘Visa’, ‘Visa Card Services’) returns a value of 14 (poor match)

In practice none of these functions can be used in this way without serious scrutiny of the results, and I found that using matching software that was built on this type of logic produced far too many potential candidates and missed lots of what we would consider perfect matches.

So, What’s the Answer?

Our approach with Match2Lists includes the following;

  • Data Standardisation
  • Probabilistic Logic
  • Fuzzy Logic
  • Extensive Knowledge Base
  • Ability to Learn from Experience
  • Leveraging of Corroborative Information
  • Iterative Approach to Matching
  • Powerful Visualisation

Data Standardization

Data Standardization helps address issues with common abbreviations such as Ltd, Limited, Corp and Corporation, Inc and Incorporated. Unlike some matching systems, we do not advocate removing the legal entity information, as we would rather match ‘Siemens Ag‘ exactly where available rather than ‘Siemens Inc’ or ‘Siemens Corporaton’; we use Probabilistic Logic to manage this.

Probabilistic Logic

With Probabilistic Logic we examine the Company Name and determine which elements are of most relevance for matching, and prioritize these. For example in ‘The Procter and Gamble Company’ we would determine ‘Procter’ and ‘Gamble’ to be more important keywords in a matching context, especially when used in conjunction with each other.

Fuzzy Logic

By combining fuzzy logic with probabilistic logic, we are better able to ascertain the probability of the data matching. For example, ‘Proctor & Gamble’ would be seen as a very probable match even with the misspelling of ‘Proctor’. Fuzzy Logic is very powerful when used in conjunction with Probabilistic Logic, as it substantially limits the quantity of False Positive matches that Fuzzy Logic is prone to provide.

Extensive Knowledge Base

With our experience of B2B Data Matching over many years, we have built an extensive library of knowledge for common acronyms used for various large businesses.  Some examples include ‘GSK’ = ‘GlaxoSmithKline’, ‘BBC’ = ‘British Broadcasting Company’, ‘GE’ = ‘General Electric’, and many many more.

For a large corporate business this knowledge is a must, as it’s not acceptable to leave ‘HP’ unmatched beacuse we had ‘Hewlett Packard’ listed instead.

This is an ever growing library, and covers international and national companies as acronyms can mean different things in different countries.

Leveraging of Corroborative Information

By using other data such as address information, telephone numbers, Longitude/Latitude coordinates, City etc. we can find potential matches where the company name is very different but where a human being can use their own common sense or local knowledge to confirm that this is indeed the same business.

By way of an example, I have recently been matching inventories of Hotel names and in some instances the Hotel name would be the parent chain name such as ‘Best Western’ in one list, and a completely different name in the other list.

Iterative Approach to Matching

Most matching applications run a single pass on the data using one set of criteria, and then output the results for scrutiny. We found this resulted in many good matches being overlooked, and in some cases inferior matches being produced while better matches were overlooked.

By implementing functionality to dynamically change the matching criteria, adjusting the importance of the collaborative information and being able to continually append the results to the same project, we were able to find more matches and ensure we approved the best possible matches.

Powerful Visualisation

Lastly, and a real game changer for us, is the implementation of a powerful User Interface to visualize the match candidates. By being able to visually scrutinize the results we were able to determine the best criteria to use, and to automatically approve many more results than before. This Visualization meant that we achieved significant reductions in the cost of our matching projects and were able to deliver the results to our customers in record breaking time.

Get in touch if you would like to learn more about our approach to B2B matching, or if you have any questions.


2 Comments


  1. This is a great article!

    I’m a SQL Developer working for a Data Services company in the north west.

    I’m currently working on an exercise to improve our data deduplication process so this information is really helpful.

    Would it be possible to have a chat about a couple of interesting issues I’m facing with this project?

    Thanks.

Leave a Reply

copyright ©2008-2014 Match2Lists Ltd