Perspective on Improving Matching Performance of Large SAP MDM Repositories
Some immovable requirements result in replicating the data model of SAP ECC in SAP MDM. Result is a large repository with over 250 fields. Once this repository is loaded with over a million records, and you start performing matching or searching using SAP Enterprise Portal you would encounter several performance challenges.A slow MDM system will never be accepted by business users. To address this eventuality it is essential to proactively identify and mitigate these risks. Clients have to be aware that replicating the data model in SAP MDM might be as easy amongst the options rather than a building a solution on eSOA. But this approach runs the risk of real time performance. For accurate matching results we use Token Equals Feature, but this results in huge lead times specially while processing million records with over 250 fields.
Following SAP Document is a good start to help in improving the performance:"How to Optimize an MDM Matching Process'
Perspective:
To enable faster performance and searching results, one way is to look at 2 separate repositories. One dedicated for matching and searching tasks and the other holding the parent repository with all fields. The dedicated repository for matching must have only crucial fields such as Matching Fields, Primary & Foreign keys. The Portal could then connect to this smaller repository for matching. Once the results are displayed on SAP Enterprise Portal, the user could then choose to add, delete or update. The resultant action would then connect to main Repository.
Keeping a smaller dedicated repository for matching also reduces the loading time. You cannot use the Slave Feature in Console as a slave repository must have the same fields such as the Parent. As per the SAP Document another good practice is to improve the speed by using Calculated Fields in the smaller repository. These calculated fields are trimmed values of matching criteria for Example First Name can be trimmed to first three characters stored in one calculated fields, first two characters stored in another calculated field etc. Using Equals feature in matching the performance can be extremely fast. But the results of this option might not be as accurate as using token equals as per the analysis done.
Handling the dilemma of choosing precision accuracy with long delays versus faster results with average accuracy was a good learning for us. A lot of "What If Scenarios" needs to be done with multiple options and time taken for each. These options could be trying out choosing different calculated fields, Different matching strategy and scores, and choice of equals or token equals for each matching field. Analysis of this Performance Improvements would give ideal insight to the approach step with quantified data supporting. With this study of matching behavior, one should be able to identify the approach for accurate results in shortest time.
A good practice during Blueprint workshops is to present all the results along with choice of matching strategy, scores and threshold limit. If there is disagreement amongst various client stakeholders while identifying the best matching criteria, statistical techniques such as Average Ranking, Spearman's rank etc can be used. Since each project is unique, generalization of approach is difficult. For Business Partners to increase accuracy, using calculated fields approach for both First Name and Last Name would be more efficient then just using calculated fields of First Name for example.
Insight into the behavior of business users would give information to decide whether to choose Token Equals or Equals using calculated fields. You could choose matching criteria with Equals using calculated fields on every day use for business users purely due to the high speed results. Matching with Token Equals can be used on periodic basis such as weekly or fortnightly to identify possible duplicates by Data Administrator. This dual approach might involve redundant activities but would ensure healthy data.
Data analysis using random sampling would give insight to spread of master data in different categories such as Country, Organization etc. Depending on the pattern of classification, you could filter records based on Country, Region, Categories-Retail Example Apparel, Food etc. Filtering would enable faster matching performance.
The best practice is to stick to the line of thought that Master Data (Global) should be stored in SAP MDM and rest of transactional fields in respective systems such as SAP ECC etc. This would enable standardized data model and attributes for global use and not replicate legacy or SAP ECC data model in SAP MDM.
Navendu Shirali is a Consultant in SAP MDM Center of Excellence at Infosys . His areas of include building new solutions, converting opportunities and Master Data Consulting.