After the Final Report for an evaluation is published, per the terms of MCC’s Standard Evaluator SOW, contractors are allowed a period of exclusivity not exceeding six (6) months during which only they will have access to the data package. This exclusivity period facilitates contractors’ completion of academic articles and other analysis prior to allowing new researchers access to the data. However, MCC aims for the full data package (all rounds of data) to be accessible no later than 6 months following publication of the Final Report; and any extension of the exclusivity period beyond six months requires approval from the MCC PM and COR. In any event, contractors should complete data preparation, review, and clearance before the end of the contract period of performance.
8.1. Data Package
MCC anticipates that data from each evaluation – or similar data activity – may fall into one of the following categories:
- Public-use data. This is data that has been de-identified or does not require de-identification and may be shared publicly without posing a risk of harm to the data providers. For independent evaluations, this data will be available for direct download from the MCC Evaluation Catalog.
- Restricted-access data. This is data that may contain identifiers (direct and/or indirect) requiring that any sharing of the data be subject to conditions that MCC determines in its discretion are appropriate to protect the data provider’s confidentiality. MCC currently does not share data on a restricted-access basis.
- No access data. This is data that cannot be sufficiently de-identified so as to be made accessible through either public-use or restricted-access. When preparing a No-Access data file, the Contractor and PM should work together to determine what Data Package Requirements should be submitted. In some cases, a full Data Package may still be submitted for full documentation of the decision to have No-Access, for some cases, it may be sufficient to notify MCC it is No-Access data and provide the Transparency Statement.
When ready to prepare data for public-use and/or restricted-access contractors should expect to prepare and submit the following package to MCC:
|DRB Data Package Worksheet||Word(Annex 5)||Contractors will complete this worksheet to document the actions taken to de-identify and prepare the data for public and/or restricted-access use.|
|Data – Clearly labeled as (i) Public Use, (ii) Restricted Access, and/or (iii) No Access (if justified)||Stata 13 (or other format agreed with MCC)||This should be the complete data file(s) – including the full dataset as collected (required) and any constructed analysis variables (optional – it is assumed analysis code will produce these). The ability to de-identify the data as per informed consent promises will inform whether or not this data is public use, restricted access, or no access.|
|Data Codebook– Public Use and/or Restricted Access only||Stata codebook output to review data – the codebook should include a label book as well as basic summary statistics including frequency and distribution information.|
|Analysis Code||Stata do file||This is the analysis code to produce the variables and analysis reported in the analysis report(s).|
|Transparency Statement||Searchable PDF||Contractors should prepare a Transparency Statement which states the extent to which data (public use and/or restricted access) can enable computational reproducibility of results presented in report(s).|
If necessary, this package should also include any updates to the Metadata for the MCC Evaluation Catalog.
8.2. Data De-Identification
To adhere to promises of confidentiality made during the informed consent process and to mitigate risks to data providers for providing PII and/or sensitive data in the data package, data that is prepared for public-use must be de-identified. For restricted-access use data, the level of data de-identification may vary depending on promises of confidentiality made. Prior to conducting data de-identification actions, contractors should:
- Consider risk factors and probability of re-identification as presented below in Table 6.
- Maintain a balance between applying data perturbation-based methods and techniques to de-identify data and ensuring the quality, usability, and relevance of the data. In many cases, significant de-identification efforts may result in data that is less useful and/or relevant, even for computational reproducibility of original study analysis.
- Carefully consider combinations of variables, even when individual variables do not pose a re-identification risk. For example, age, gender, or marital status alone may not pose re-identification risk, but when combined these variables may be sufficient to identify the data provider, resulting in a re-identification risk.
|Risk Factor for re-identification||Lower probability||Higher probability|
|Sample representation: Are outliers in the data outliers in the general population?||When the sample is a small percentage of the general population, visible and known characteristics that are outliers in the sample may not pose a re-identification risk because there are other similar individuals/households/businesses/etc. in the sample frame||When the sample is a large percentage of the general population, visible and known characteristics that are outliers in the sample may pose a stronger re-identification risk because there are few to no other similar individuals/ households/businesses/etc. in the sample frame|
|Linkage documentation: What documentation about the sample exists outside the research data but can link to it?||If little to no documentation exists about the study sample, then linkage documentation may not pose a re-identification risk||If documentation exists about the study sample, then linkage documentation may pose a re-identification risk (examples: loan information obtained on study sample mirrors loan information at bank)|
|Timing and population characteristics: How closely does the data reflect current and future state for the sample population?||If significant time has passed and the study population is transient or nomadic, there is lower re-identification risk||If the data was recently collected and the study population is more permanent, there is higher re-identification risk|
Once the above has been considered, contractors may consider the following high-level data perturbation techniques for data de-identification:
- Removal of all direct identifiers. Removal of direct identifiers may not be as simple as removing the specific variables where known direct identifiers were recorded by the survey team. For example, the written response within “Other” responses may include direct identifiers.
- Geographic units. Contractors should consider the highest geographic level that should remain identifiable for specific analytic purposes and de-identify all lower geographic units. Similar to the discussion above on sample representation, the higher the geographic unit that is de-identified, the lower the risk for re-identification at individual, household, and other sample unit levels and often less data permutation is necessary on a variable-by-variable basis.
- Top and Bottom Coding. When specific continuous variables are visible and/or known characteristics about the data provider (i.e. visible asset holdings, age, years of education), outliers may need to be considered for top and bottom coding. There is no specific rule (top and/or bottom 2%, 5%, etc.) given the decision on where to cut outliers should be made based on the data and what is known about the study sample population. To retain data values and avoid lost data, contractors can send outlier values to the median once a threshold is identified.
- Re-categorization. When specific categorical variables are visible and/or known characteristics about the data provider (i.e. ethnicity, religion, language spoken, education level), minority groups may need to be considered for re-categorization. To retain the value of the data, it’s preferable to re-categorize into meaningful groups, combining categories, rather than collapsing into an unknown “Other” category. However, this is dependent on context, data, and risk.
- Removal of indirect identifiers. When specific variables cannot be retained given potential re-identification risk, the variable(s) should be removed from public-use datasets (and clearly documented as removed).
8.3. Qualitative Data
As of February 2020, typically MCC does not expect qualitative data to be prepared for public-use given unknowns 1 regarding de-identification and usability of qualitative data. However, if contractors determine it is feasible and appropriate to prepare the data for restricted-access, they should work with MCC on determining whether and how to proceed with data preparation for sharing.
All relevant documentation should still be shared for public dissemination as per Table 3.
8.4. Transparency Statement
Contractors should prepare a Transparency Statement which states the extent to which data (public use and/or restricted access) can or cannot reproduce the results presented in the evaluation report. This will be discussed with the DRB and then finalized based on the final approved data file(s). Contractors may reference Annex 6 as a template.