Any data team whose business collects customer data from multiple sources is likely familiar with the concept of data duplication. For example, retailers who collect information from their customers during e-commerce and store checkout processes, who run a loyalty program, and who provide telephone support often experience this issue. Similarly, banks that run their business with functional silos like retail, private, mortgage, personal loans, credit reporting, commercial, wealth management, and so on can find it challenging to avoid data duplication. Whenever data is collected from businesses or consumers and entered into different systems or acquired from various sources, it’s often difficult to check to see if a record for that person or entity already exists, so a new form is created. Over time records get created in multiple systems, record identities often differ, and data duplication occurs.
Data duplication can occur for many reasons. For instance:
1. Names, contact information, and other details are misspelled. This frequently happens when consumers, perhaps in a hurry to complete an application process, an online purchase, or taking points over the phone, make spelling mistakes.
2. There are different format requirements for the separate systems or other acceptance criteria.
3. Consumers will modify their information to avoid being blocked. For example, consumers will add bogus apartment numbers to help differentiate attempts to get single-use online coupons.
4. Other causes can be entirely circumstantial such as when a customer moves and turns up later with a different address, when a customer marries and takes on a new name, or when a parent opens an account for a child who later assumes the credentials but doesn’t update the account information.
Real-time data validation software can help significantly avoid capturing incorrect mailing addresses, phone numbers, and email addresses, but other "mistakes" or "differences" can still occur, complicating data collection. For instance, when there are different systems for each department, they are likely to have varying data structures, schema, labels, and formats. As a result, data merging or comparisons in such circumstances can be complex. For instance, one system might label the account holder's name as "account name," while another may use the label "client name" or just "name." Another example is that addresses are parsed into various components like street, city, state, and zip code. At the same time, another database could have a record labeled "address," where the full address is stored as a single value. It's easy for humans to interpret these labels as the same. Still, a computer will need to be told exactly how to compare the data from two or more varying systems to identify duplicates and determine how to deal with them.
So what? What harm does duplication do?
A small amount of data duplication may cause a few problems. Still, as data volumes grow, duplication not only adds to the cost of storing data, it also slows processing, adds work and frustration, and can adversely affect reports and the results of analyses. In the worst case, business decisions based on the outcome of poor or duplicate data may have profound implications. For example, sending multiple identical messages to the same customer may cause irritation, impact reputation, and even lose the customer.
The rapid growth and use of digital processes in the past few years have tremendously exacerbated data duplication issues as the volume of data generated and to be stored has soared. Accurate data is needed to access third-party data and to optimize advertising spending. However, duplication can add to the cost of purchasing third-party data for segmentation and marketing purposes and make advertising buys less effective. And for training AI and ML models, duplicated data can affect the accuracy of the models generated.
How can machine learning help reduce data duplication?
As explained above, data duplication can occur for many reasons, so no single solution can help reduce or eliminate it. Manual approaches generally rely on creating rules that can be run against the data. This can work when the causes behind the duplication are clear. For example, if one database holds a valid first name, last name, date of birth, and address data. In contrast, a second database only has first initial, last name, date of birth, and address, then a rule can be written that if the last name, address, and date of birth match and if the first letter of the first name matches the initial then the records are duplicates (note that twins with the same first initial would still cause an issue). But even in this simple case, describing and creating this rule is quite complex. Creating rules in complex cases soon becomes overwhelming, especially if the rules to address one issue clash with the rules to address another.
The above example assumes the data in each dataset are labeled similarly. When the labels are different, we can use machine learning to identify the data type in each field of the dataset and (smart) tag the data accordingly. For large and complex datasets, this approach can quickly help identify what data from which fields should be compared. For example, suppose account numbers, social security numbers, or credit card numbers are in specific columns. In that case, these can be identified from training data (despite possibly having differing column labels, e.g., Soc Sec vs. SS#) and leveraged to determine which pairs should be compared. Of course, it will be essential to understand the uniqueness and completeness of the data in each field before commencing this exercise.
Once the data fields or columns have been similarly tagged, then it’s possible to start to compare entries. Typically, this is completed using fuzzy matching or a combination of fuzzy matching and a rules-based approach. Fuzzy matching is used to determine how similar two different entries are. So, for example, Nigel and Nigal would be names perceived as very similar but could still belong to two different people. That’s where the rules would come into play. If the first names are different but very similar and the rest of the data in the records match, then there’s a high likelihood that the two records are indeed for the same person. At this point, such matches can be bought to the attention of the database manager or a data steward for final resolution, at least for smaller, less complex datasets.
For larger datasets and more complex situations, human intervention is less practical. At that point, machine learning can help by generating suggested rules. Machine learning is trained by presenting a subset of matches to a human operator and understanding which rules produce accurate matches. Once taught, machine learning can automate the creation of rules considering the results of fuzzy matching and developing a far more comprehensive approach.
AI and machine learning techniques for deduplicating complex datasets and multiple data siloes will continue to evolve. Nevertheless, it will also be essential to maintain human oversight to ensure that the rules created make sense and that data engineers understand how the algorithms generated are adapting and modifying the data. In addition, black box approaches may create opportunities for unsanctioned data manipulation and fraud, particularly for financial datasets.