The Weill Cornell Medicine Data Core is a secure computing environment for researchers, managed by staff in ITS Operations and the Data Core Team in the Samuel J. Wood Library. The Data Curation Service, a subset of the Data Core Team, enables scientists to engage with electronic data while adhering to project-specific guidelines for data security and research integrity, including data governed by the Health Insurance Portability and Accountability Act (HIPAA), the institutional review board (IRB), and the Federal Information Security Management Act (FISMA).
Members of the Data Curation Service are responsible for secure data export. Data used in the Data Core often originate from affiliates (Data Providers) external to the institution, including government agencies. The use of these data is often governed by a data use agreement (DUA), a contractual document used for the transfer of non-public data. When required by the agreement with a Data Provider, Data Curation Service team members review files to prevent unauthorized disclosure of protected health information (PHI), in a process referred to in this document as disclosure proofing. Data Core staff then download the files via a secure file transfer method, such as WinSock File Transfer Protocol (WS-FTP), and deliver them to the authorized user via a method deemed secure and appropriate to the content of the file.
In accordance with the above-stated security requirement, the following policy is designed to guide the process of disclosure proofing user-created data files for export from the Data Core.
Entities Affected by this Policy
This policy is written to apply to faculty, staff, and students working in the Data Core, whose source data are:
a. provided by an entity other than the researcher, and
b. not publicly available.
Who Should Read this Policy
All individuals accessing or storing data in the Data Core, as well as all individuals sending, receiving, or transmitting any data to or from the Data Core.
Reason for Policy
Despite the myriad benefits in data sharing, protection of human subject privacy remains an imperative consideration. While de-identification of directly identifying information, such as the 18 identifiers included in HIPAA regulations, was once considered sufficient to protect participant privacy appropriately, recent literature indicates otherwise. See the “Additional Resources” section below for more information.
These definitions apply to terms as they are used in this policy.
Disclosure proofing is defined in this document as the process by which the risk of a breach of participant privacy is maximally reduced. Associated terms include disclosure risk limitation and differential privacy. Specifically, for the Data Core, disclosure proofing includes ensuring that data exported from a Data Core project does not disclose information that should be retained in the Core according to the associated DUA for that project.
Note: the act of disclosure proofing does not necessarily involve transformation of data but rather is restricted to assessment of the need for transformation. When it is determined that transformation is needed, recommended changes are communicated back to the requestor.
Disclosure risk limitation defines disclosure as public identification of the identity of the individual reporting units and information about them (1).
Differential privacy is a rigorous mathematical definition of privacy. In the simplest setting, consider an algorithm that analyzes a dataset and computes statistics about it (such as the data's mean, variance, median, mode, etc.) Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual's data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset – anything the algorithm might output on a database containing some individual's information is almost as likely to have come from a database without that individual's information. Most notably, this guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual's details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not disclosed (2). The risk of re-identification can be quantified (3,4). Methods involving decision rules have been proposed for use in estimating the uniqueness of data in clinical data sets (5).
1.01 User decides to share or export data
- The user, who may be the PI, a student, a researcher, or another authorized representative of the project, submits a request for data export via the Data Core’s Export Request Form in Qualtrics, a commercial survey tool managed by WCM ITS, or via email. If the request does not contain sufficient information for curation, then the Data Curation Service will ask the Principal Investigator (PI) for clarifications.
- The Qualtrics form automatically generates a ticket in ServiceNow, the WCM ITS service management system, and sends an email to the Data Core listserv. If some other request path is used, all activity should still be recorded or referenced from a single ServiceNow incident ticket, whose number becomes the export’s unique identifier for record-keeping purposes.
1.02 The Data Curation Service reviews relevant data governance documentation, including DUAs and IRB approval notifications, to determine whether data governance allows the PI to have direct access to all data.
If so, the requested file(s) is/are released to the requester. If not, the process continues to the next step.
1.03 The Data Curation Service reviews the data’s corresponding DUA and:
- verifies data sharing of user-generated data is not prohibited, and
- identifies explicit restrictions or specifications for sharing or exporting data.
a. All explicit restrictions or specifications are documented in the notetaking log for the associated export and will be included as requirements for approval of export.
b. If no restrictions or specifications are explicitly stated, the following data review will be followed (as described in the process below).
1.04 Data Review process
The Data Core Team creates a ServiceNow ticket requesting transfer of the specified files for export, to be copied into the Data Core Team’s unique project environment to which only the team has access.
- The Data Curation Service determines whether the files constitute programming code (e.g., SAS code) or data that require review for HIPAA compliance.
- If the files are programming code files, one member of the Data Curation Service reviews to ensure that there are no data embedded in the code.
- To ensure an accurate assessment, two designated members of the Data Curation Service review data for high-risk variables and entries. Data properties to be evaluated include but are not restricted to the following:
a. Direct identifiers (such as social security number, medical record number, etc.)
b. Indirect identifiers (such as uncommon disease or medical center)
c. Geographic identifiers
d. Age of patient when data was collected
e. Outlying data
- Adherence to DUA-specified restrictions or specifications are reviewed.
- De-identification by one or both of the following two approaches, in accordance with HIPAA’s Privacy Rule (6), is validated:
a. Safe Harbor Rule: HIPAA’s 18 directly identifying variables (listed below) are suppressed.
ii. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
iii. All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
iv. Telephone numbers
v. Vehicle identifiers and serial numbers, including license plate numbers
vi. Fax numbers
vii. Device identifiers and serial numbers
viii. Email addresses
ix. Web Universal Resource Locators (URLs)
x. Social security numbers
xi. Internet Protocol (IP) addresses
xii. Medical record numbers
xiii. Biometric identifiers, including finger and voice prints
xiv. Health plan beneficiary numbers
xv. Full-face photographs and any comparable images
xvi. Account numbers
xvii. Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section; and
xviii. Certificate/license numbers
b. Expert Determination: Covered entities may also use statistical methods to establish de-identification instead of removing all 18 identifiers. The covered entity may obtain certification by "a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable" that there is a "very small" risk that the information could be used by the recipient to identify the individual who is the subject of the information, alone or in combination with other reasonably available information. The person certifying statistical de-identification must document the methods used as well as the result of the analysis that justifies the determination. A covered entity is required to keep such certification, in written or electronic format, for at least six years from the date of its creation or the date when it was last in effect, whichever is later (7).
- Data-specific treatment may be recommended by an expert member of the Data Curation Service. This involves review of the unique properties of the data and comprehensive analysis of potential risks to participant privacy. Recommended treatments to be applied to the data (6,8) may include:
a. Removal — Eliminating the variable from the dataset entirely.
b. Top-coding — Restricting the upper range of a variable.
c. Collapsing and/or combining variables — Combining values of a single variable or merging data recorded in two or more variables into a new summary variable.
d. Sampling — Rather than providing all of the original data, releasing a random sample of sufficient size to yield reasonable inferences.
e. Swapping — Matching unique cases on the indirect identifier, then exchanging the values of key variables between the cases. This retains the analytic utility and covariate structure of the dataset while protecting subject confidentiality. Swapping is a service that archives may offer to limit disclosure risk. For a more in-depth discussion of this technique, see O’Rourke (2006) (9).
f. Disturbing or adding noise — Adding random variation or stochastic error to the variable. This retains the statistical properties between the variable and its covariates, while preventing someone from using the variable as a means for linking records.
g. Suppression — These methods remove or eliminate certain features about the data prior to dissemination. Suppression of an entire feature may be performed if a substantial quantity of records is considered as too risky (e.g., removal of the ZIP Code feature). Suppression may also be performed on individual records, deleting records entirely if they are deemed too risky to share.
h. Generalization, or abbreviation, of the information — These methods transform data into more abstract representations. For instance, a five-digit ZIP Code may be generalized to a four-digit ZIP Code, which in turn may be generalized to a three-digit ZIP Code, and onward so as to disclose data with lesser degrees of granularity. Similarly, the age of a patient may be generalized from one- to five-year age groups.
i. Perturbation — Specific values are replaced with equally specific, but different, values. For instance, a patient’s age may be reported as a random value within a 5-year window of the actual age. In practice, perturbation is performed to maintain statistical properties about the original data, such as mean or variance.
j. Removal or treatment of state information — The Safe Harbor Rule seeks populations greater than 20,000 for geographic identifiers, but state information will sometimes be removed or otherwise treated prior to data disclosure as well(10).
k. Data element encryption – the encrypting of a data element within the data set to match a coded set which must be unencrypted by those using that data set. This is a suggested alternative to providing state information.
- Rationale is provided for selection of recommended data treatments or absence thereof.
- The dataset is signed off by the following as having been sufficiently disclosure proofed:
1. Someone with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable – these individuals generally have been certified in data de-identification techniques, and
2. The PI
1.05 Expiration of expert certification
A. When data are determined satisfactorily disclosure proofed, the expert who evaluated the data will also determine whether an expiration of certification should be applied as well as the date on which expiration should occur.
- Informal note taking logs for each export request.
- Database containing record of all data exports.
- Dataset-specific sign-off with rationale for the chosen treatments or absence thereof (see, “Data Export Review” form, attached)
In accordance with HHS recommendations(6), this policy is reviewed and updated annually by the Weill Cornell Medicine Data Curation Service to ensure that data privacy is maintained when data are de-identified and disclosed, and that identity disclosure cannot be accomplished by triangulating data from various publicly available datasets.
- Inter-university Consortium for Political and Social Research. Deductive Disclosure Risk [Internet]. [cited 2017 Oct 10]. Available from: https://www.icpsr.umich.edu/icpsrweb/content/DSDR/disclosure.html
- Dwork C. Differential privacy: A survey of results. In: Agrawal M, Du D, Duan Z, Li A, editors. Theory and applications of models of computation. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 1–19.
- El Emam K, Arbuckle L. Anonymizing Health Data. Sebastopol, CA: O’Reilly Media, Inc.; 2013.
- El Emam K. Guide to the De-Identification of Personal Health Information. Boca Raton, FL: CRC Press; 2013.
- Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak. 2012 Jul 9;12:66.
- U.S. Department of Health & Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule [Internet]. [cited 2017 Aug 6]. Available from: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/
- U.S. Department of Health & Human Services. How Can Covered Entities Use and Disclose Protected Health Information for Research and Comply with the Privacy Rule? [Internet]. [cited 2017 Oct 9]. Available from: https://privacyruleandresearch.nih.gov/pr_08.asp
- Guide to Social Science Data Preparation and Archiving. Ann Arbor, MI: Institute for Social Research University of Michigan; 2012.
- O’Rourke JM, Roehrig S, Heeringa SG, Reed BG, Birdsall WC, Overcashier M, et al. Solving problems of disclosure risk while retaining key analytic uses of publicly released microdata. J Empir Res Hum Res Ethics. 2006 Sep;1(3):63–84.
- Bohannon J. GENETICS: Genealogy Databases Enable Naming Of Anonymous DNA Donors. Science [Internet]. 2013 Jan 18; Available from: https://insights.ovid.com/science/scie/2013/01/180/genetics-genealogy-databases-enable-naming/14/00007529