Why SAP data consistency and anonymisation is not a clear-cut process

Written by James Watson | 21 October 2021

Data privacy is a topic that gets a lot of focus and organisations need to consider how to comply with regulations that protect an individual’s rights to data privacy. In Europe, GDPR is the legislation that was brought in to set a new standard for regulating data privacy, but around the globe many countries have updated their legislation to incorporate data privacy, for example the POPI Act in South Africa, LGPD in Brazil and CCPA in California.

If you are reading this blog, it’s likely that you are already reviewing your organisation’s data security needs and may have started implementing some solutions to address some of your concerns. Today, I would like to share some of the experience I have gathered by working for EPI-USE Labs on multiple data privacy projects, specifically in the context of SAP^® systems.

Typically, there are three clear voices when workshopping data privacy requirements:

For test managers and product owners: the clear priority is to be able to test processes on ‘real’ data (something which has all the vagaries of production, including the things that are not 100% correct).
For data protection teams: the priority is that no Personally Identifiable Information (PII) can exist outside of production.
For IT and infrastructure teams: the focus is on speed and efficiency for delivering the copied and scrambled data with minimal disruption to the non-production landscape.

These three conflicting views provide a specific challenge in delivering a secure and useable development operations environment. In this post, I want to explore the difficulties in providing the aligned production-like data which Test Managers wish to see.

Here are two data scenarios to consider:

Scenario 1

A simple example of an employee who works in the procurement department with tax numbers in Spain. An employee is a common focus area when you think of Personally Identifiable Information. To break it down and show some of the standard SAP processes that store the sensitive data, here is a high-level diagram that shows the data relationship in the employee object.

In the standard SAP data model, there can be hundreds of sensitive data fields. Additionally, it would have customisations you have made through history, where additional copies of the data have been stored. All of these data items have to be populated by a person at the outset and, as such, carry the risk of human input error. Over the years, it is likely that corrections, cleanse programs and manual fixes by the business could leave the data thoroughly out of sync.

The power of the system is being integrated and connected, but also potentially creating more complexity should you look to mask or scramble the data for data privacy compliance.

Scenario 2

An example requested by a client to mask the VAT tax number in Spain and Portugal. In this scenario the business rule is very simple: the field STCEG in KNA1 should be the concatenated value of the country and field STCD1 for that client:

KUNNR	LAND1	STCD1 - KNA1	STCD2 - KNA1	STCEG - KNA1
0000000001	ES	A12345678	-	ESA12345678

Here is an analysis of the data to help us understand the variation in the data. The analysis looks at the length and consistency of this rule prior to any data anonymisation activity. From this analysis, I found:

six different tax number lengths within the country of Spain alone
10% of data which didn’t match the originals
a further 20% of examples where a field was blank
25% where there were alignment rules – but not the expected customer requirement
more than 1,000 different scenarios of consistency when additional countries were overlaid and cross-system integration with CRM were considered.

In the end, just 35% of the data fitted into the requirement that the client had given as the ‘ideal world’ or most traditional mapping.

So what is the impact of these?

Going back to my original three priorities to deliver real production data that is masked to address security concerns, and also created in an efficient manner, one needs to know how the data is linked and understand how data quality plays a role in the system.

In the first example, you see that just catering for the obvious employee infotype data will leave a huge amount of related sensitive data without anonymisation.
In the second example, where we just had to set the two tax fields alone, it would require logic to cater for 1163 scenarios. The run-time for this when comparing and making decisions would be far greater than the allowable downtime for a system refresh and scrambling.

At EPI-USE Labs, we work with clients to analyse the data in their SAP system and to predict where issues can arise in the data. We also provide potential solutions to give each process owner the best possible outcome.

Please take a look at this webinar where I discussed data analysis in SAP for the purpose of security in non-production environments. During the webinar, I also demonstrate some of the software we use during a Data Privacy Workshop and Analysis.

View full post