Security

Sensitive Data Masking: A Developer's Guide

Data masking is an essential technique for protecting sensitive information in development and testing environments. With LGPD in effect, understanding how to replace real data with valid fictitious data is an indispensable skill for every Brazilian developer.

What is data masking?

Data masking is the process of replacing real sensitive data with fictitious data that maintains the same structural characteristics. A real CPF like '123.456.789-00' is replaced by a valid fictitious CPF generated algorithmically, preserving the format and check digit validation.

Unlike encryption, which makes data unreadable, masking produces data that looks real and works in system validations, but doesn't correspond to any real person. This allows developers and testers to work with realistic data without the risk of exposing personal information.

When is masking necessary?

LGPD requires that personal data be protected at all stages of processing, including development and testing. Staging, QA, and development environments frequently receive copies of production data — without proper masking, this constitutes a legal violation.

Beyond legal compliance, masking prevents accidental leaks. Developers working with real data may inadvertently expose it in logs, screenshots, Git repositories, or debugging tools. Masked data completely eliminates this risk.

Masking techniques for Brazilian documents

For documents like CPF and CNPJ, the most effective technique is substitution with valid fictitious data. Instead of simply scrambling digits (which can generate invalid documents), use generators that respect the check digit algorithm.

For other fields like names and addresses, techniques such as shuffling (mixing between records) and substitution from a lookup table are effective. The important thing is maintaining referential consistency: if a CPF appears in multiple tables, it should be replaced by the same fictitious CPF in all of them.

Implementing masking in the data pipeline

Masking should be automated and integrated into the data pipeline. When copying data from production to lower environments, a masking script should run automatically, replacing all sensitive fields before any developer has access.

Tools like CPF, CNPJ, credit card, and other document generators are fundamental pieces in this pipeline. They ensure that masked data is valid for the application's business rules, avoiding cascading test failures.

Masking vs. synthetic generation

While masking replaces real data, synthetic generation creates completely new data from scratch. Both approaches have their place: masking maintains the statistical distribution of original data, while synthetic generation offers greater control over test scenarios.

The ideal approach combines both techniques: use masking when you need to maintain relationships and volumes similar to production, and synthetic generation when you need specific scenarios or data with controlled characteristics. Tools like help4.dev facilitate synthetic generation of valid Brazilian documents.