In this next example, we examine a typical analytics or data science use case and the various zones through which data might flow, eventually to be stored and perhaps even copied further into specific data scientist personal use sandboxes. Clearly, the data here will be traversing through multiple zones and sub-zones to support this objective. A clear understanding of the right security and privacy controls to apply must be carefully evaluated. You might recall from our graying of the lines between production and non-production environments discussion earlier in this section, how your organization defines these zones of ingestion and discovery, whether you see them as more prod-like or not will impact both the types and amounts of security mechanisms applied, as well as the degree of data privatization that is appropriate for the destination use zones. These decisions and the articulated privacy policies supporting them must factor in the organization's risk tolerance point of view. As depicted in this first image on the left, some organizations might be tempted to use a similar model as in the test data management use case. Where the ingestion zone could effectively become the safe de-identification of data zone. But this decision needs to be based on a full understanding of the complete array of use cases of data that may need to be supported out of each of the included zones. For example, the ingestion zone may need to support other downstream data use case needs that require less or perhaps no data privatization. So let's look at a slightly different perspective and approach. We start with a modified topology model that will treat the ingestion zone more as a restricted production-like zone. We determine this topology design is necessary in order to support aggregation of data that is likely to be moving into this ingestion zone from a variety of production systems of record and potentially some sensitive external data sources. All of this data must be appropriately cataloged, understood from a personal and sensitive data perspective, and protected as if it were in any production zone. So it will be more likely that the security controls applied to this zone will be very similar to the classic security techniques used in production, and it's depicted here using the similar environment boundary border as that of productions. The numbers and types of authorized users able to access this zone, however, will likely be very different than production and more likely similar to what we saw for the test data management use case safe zone. Additionally, data specific privatization may not yet be viable for this data zone in order to persist the highest levels of utility that support data aggregation and/or any downstream requirements for non-perturbed data. For example, some of this information aggregated in the ingestion zone may eventually be moved to traditional warehouses or other guided analytics zones where minimal data utility loss is usually the norm. One exception where some privatization might be applied in the ingestion zone would be for external data, where some degree of tokenization to limit or protect against yet unknown personal data consent management issues that might be included with using this content may be prudent. So think here a safety first approach. As with our test data management use case, our security and privacy policies should be capturing and prescribing for this set of zones and use case, the outline we just highlighted as to how these zones should be architected with security by design features. This would include, again landed and in-flight data encryption controls, and centralized authentication and authorization management, as well as strong user access and use monitoring. Additionally, with more zones involved, privacy policies will need to describe and prescribe approaches for the different classes of authorized users and types of access and landed data area access that require protection. For example, how will data scientists specific copies of discovered data or parts of the data that they will be allowed to load into their personal data sandboxes be protected with the correct combined security and privacy controls? Or will the policy prescribed prohibiting from creating any further landed copies of this data? These must be decided and then policy articulated and technically enforced. Additionally, in this use case, it's likely we will need to do more analysis and application of the correct amounts of authorized application integration processing, including integrations with external to the company resources, writing data through some external integration channel to some internal datastore. This data movement and treatment processing must also be well-defined and processing ID authorized via ticketing types of mechanisms such as what would be used in a Kerberos enabled environment. Another difference in this use case approach is the premise that we do not have some intermediate safe de-identified data zone where data will be landed and managed, and instead, we've baked into the processing of data movement from ingestion through to discovery the additionally required data privatization mechanisms appropriate for the use case. This approach even if it requires some temporary processing storage seeks to reduce as much as possible any footprints of sensitive data that are making their way through the discovery zone. This can get complex if there are many sub-discovery zones that may require varying degrees of privatization. But in this example we'll focus on a base perspective that could serve as a starting point. On a privatization side again, as in the test data management use case, we look to the data classifications of data that will be included in the use case and topology for direction on what data's personal and sensitive and requires zone specific de-identification. We would look to our privacy policies for the behaviors that are required to guide the de-identification processing, and configure and implement and execute the de-identification processing to the appropriate use case zones. Commonly for base discovery zones, the safest approaches would include synonymizing or tokenizing all direct identifier information, and then adding some degree of anonymization techniques or advanced masking techniques to indirect identifier data elements. This is a process that usually requires validation with the analysts and data scientists in defining the prescribed privatization policies that balance the organizations need for utility or accuracy of this data to support meaningful analysis and modeling, and the levels of risk the organization is willing to assume for landed data in a discover type zone. Clearly, this privatization degree risk calculation balance will also factor in all the other comprehensive security controls put in place for this zone set of sub-zones. It will then be very common to include some level of risk of personal data re-identification analysis in this zone. Especially since it is likely that external datasets will be included. This analysis is one that will require a regular reprocessing and revisiting to make sure that the risk assessment configurations are keeping pace with any changes in the data content or data shapes or classifications, and we will discuss more on risk assessment types in the privacy mechanism section of this series. Finally, for the copies of data from the larger discovery content that might be moved into individual data scientists sandboxes, assuming privacy policies allow this, this content must be scrutinized or controlled to protect against adding in or reintroducing personal or sensitive data leakage. Some approaches that we've seen include controlling the query outputs. For example, controlling or blocking iterative or incremental querying by the data scientists as they try to view and the larger discovery datasets to accomplish and a greater data understanding. As we discussed previously, the nature of this iterative analysis can unintentionally or maybe even intentionally create data subject re-identification by zeroing in on outliers or singularities in the data that may not yet have been privatized in the full data population. The organizations data privacy teams must evaluate the probabilities of these scenarios occurring and address how and what they are willing to accept or not accept from a personal data re-identification risk, TOMS perspective. It's worth noting that some other approaches that IBM has been developing to address these kinds of risks include; building auto AI techniques for the data scientists to use. These can both accelerate the time to model-building and eliminate some of the typical grunt work of Dataprep and feature engineering which is a granular data access exercise that most data scientists are accustomed to doing, while reducing their direct interactions necessary with the data that as we've noted unfortunately are often the high-risk types of access that can lead to data subject re-identification. There were also some IBM focused research initiatives around leveraging or improving differential privacy approaches to supporting requests that can add more noise to the data without disturbing the agreed to levels of utility and accuracy for the ultimate model purpose building needs. The net takeaway is that the correct types and amounts of privatization that you need to apply driven by your privacy policies is ultimately decided by the Data Governance and Risk Management teams in concert with the Data Privacy and Data Science officers. Lastly, for the ingestion in discovery and any respective sub-zones or sandboxes, appropriate data retention must be applied as per the organizations data retention policy guidance. This is a key data management and governance best practice in traditional information lifecycle management circles and has been gaining greater visibility in the more recent data privacy regulations that are looking for tangible organizational and technical measures that can provide personal data minimization and consented use and processing evidence.