Test Data Management

Informatica has just released the latest version of its Test Data Management products, which are a part of its Information Lifecycle Management (ILM) suite that it acquired when it bought Applimation in February 2009. Since the acquisition this is the first major release to integrate with the PowerCenter platform.

There are, in fact, two Test Data Management products: Data Subset and Data Masking. I will deal with the second one of these first.

Data masking is the process of hiding sensitive data, and it might be sensitive for various reasons. Most obviously, it might be subject to data protection regulations: for example, credit card or social security numbers. However, this is not limited to compliance: Chanel, for example, probably wants to protect the formula for Chanel No 5; so data masking potentially also applies to intellectual property that the organisation would like to remain secret.

Now, bear in mind that we are talking about test data management here. We are not simply hiding data—we are masking it in such a way that allows developers to test new applications or a new version of a database against real (or what looks like real) data. So simply putting an x in place of numbers in a credit card number will not be sufficient if the application has to be tested to recognise those numbers.

Instead you will need a credit card number that looks real but isn’t. And you can’t just use a standard algorithm to jumble up the numbers or to generate new numbers from the existing ones, because these can be cracked: you actually need some pretty sophisticated software for masking that averts the threat of people who might try to hack and then reverse-engineer your masked data.

The second point is that you need to know where the data to be masked is. Here is where Informatica’s acquisition of Applimation makes sense (or, at least, one of the areas) because you can use Informatica’s data profiling tools to discover the relevant patterns (xxxx-xxxx-xxxx-xxxx say) for you automatically.

Thirdly, while Data Masking is part of the Test Data Management suite, bear in mind that it is also applicable in other environments, and not just archival either: data migration, for example, or securing SWIFT messages.

The other part of this announcement is Data Subset and this is specifically about Test Data Management. There are a few ways that you can get test data for testing purposes but perhaps the most obvious is to take a copy of a part or all of your production database. Taking all of it is typically too much and requires greater hardware resources and therefore costs but just copying a chunk of it more or less at random is not very useful either, because you can easily miss outliers that you need to test against. Indeed, it is the outliers that you most need to test against.

A more efficient option is to use a tool like Data Subset. This allows you to define formal rules and policies for extracting a copy of a subset of your production database for testing purposes. The advantage, of course, is that you can ensure that the test dataset is representative of the production database as a whole (including outliers) so that the testing is thorough.

Needless to say, Data Subset integrates with Data Masking and both of them with PowerCenter—you can define a rule as a PowerCenter maplet. There are also validation facilities that support the data steward’s role.

Finally, the company has announced that it will be introducing accelerators that will give a kick start to test data management for environments such as Oracle’s and SAP’s. These should be available early next year.

To conclude: I know that the test data management, masking and the other technologies discussed here are not sexy. But that doesn’t mean that they aren’t important. Data masking, in particular, should be considered mandatory, especially if any of your development is outsourced and even more especially if any or all of it is overseas. Otherwise you could be subject to substantial fines.

Philip Howard is Research Director (Data Management) at Bloor Research. Data management refers to the management, movement, governance and storage of data and involves diverse technologies that include (but are not limited to) databases and data warehousing, data integration (including ETL, data migration and data federation), data quality, master data management, metadata management and log and event management. Philip also tracks spreadsheet management and complex event processing.