Posted on September 09, 2020
We often get the question: how safe is it to transfer patient’s genomic data to the cloud? By this most people refer to the transfer of their NGS-based patient data to external web-based services (e.g. Illumina Basespace) that run on the public cloud of other third-party providers (e.g. Amazon AWS).
Of course, patient genetic data security should be a top priority. New sequencing technologies create large data volumes for each patient, and often small labs and hospitals assume that they cannot cope with the required increase in computational resources, leaving the cloud the only option. After all, cloud-based solutions are good for cases where varying amounts of data must be processed flexibly. But it is also true that a patient’s genetic testing data has a different privacy value compared to for example videos, music, images or other large data. Genetic testing data contains highly personal information, including health, behavioural and biometric information, and is an even more unique marker than a fingerprint. A DNA data breach is much worse than a credit card breach.
With that in mind, let us address the following misconceptions and myths we often hear.
A good practice in clinical diagnostics is the use of anonymized patient identifiers instead of real names. In the age of genomics, however, this is not sufficient anymore. Everybody working in genetics should internalize this: there is no effective way to anonymize genetic data! It is absolutely possible to infer the person from any sufficiently large genetic data, because the data is intrinsically identifying.
For this very reason, genetic data is considered Personally Identifiable Information (PII) according to Europe’s privacy law GDPR. Other PII includes fingerprints, real names, identification numbers, location data and online identifiers (e.g. IP addresses). In addition, because genetic data can reveal additional information about a person, such as ethnicity, it is explicitly part of “special categories of personal data” according to Article 9 of the GDPR. That means that the GDPR puts in tight regulations for the handling of genetic data itself. These include intense security requirements, limited data access guidelines and non-trivial obtaining of legally watertight consent. Any sloppiness with or even leakage of genetic data therefore comes not only with a loss of reputation but also potentially large GDPR-related compensations.
It might be true that some vendor X has a great IT security team and therefore is able to secure their servers very well. Yet, your data still is likely less secure. Consider that the total risk for data leakage is not the risk of the weakest part; it is the combined risk of all involved parties. Some IT specialists describe cloud as “just someone else’s computer”. In practice several parties (e.g. the web service provider, the infrastructure provider, third-party service providers) potentially have access to the computing and storage devices that contain patient data. Obviously, the more parties have access to data, the higher the “attack surface” which describes the potential for data leaks. This total risk should be minimized.
A leakage can occur at any involved party, be it your lab, a service vendor, additional cloud vendors etc. Consequently, you should be aware of and audit the security of all involved parties. There are vendors that perhaps you don’t even think of in the first place, such as the payment processing vendor that is linked to a 2019 breach of patient records of more than 19M american patients.
When a leakage occurs, it is usually difficult to determine who is really at fault, as can be seen in the case of AWS and Capital One. The legal consequences in such a case are difficult to assess and insurance companies will only compensate when it is clearly established who has committed a mistake. But, no matter who is at fault, the reputation of involved diagnostic labs will be ruined.
As a final point, if any malicious actor would be able to obtain access to your labs computer infrastructure, it is likely that they will be able to use this to access information actually available via external providers or services. So total risk should always include the local lab security.
In order for the data to be analysed at external vendors, they must have complete access to the raw, unencrypted data. Nobody can do meaningful data analyses with encrypted data. To display patient reports to you, those reports must be stored at the vendors database. When you use a VPN to work with the external service over the internet, the data is only encrypted on the way from your computer to the service provider. Actually, a similar kind of data “transport encryption” is activated on nearly every website you visit by default using SSL (see the lock symbol in your browser bar). This means that any additional communication security between your web browser and service provider is mainly window dressing, with no real security benefit.
There are many certifications that can be applied in the scope of genetic data processing. Many of them do not specifically address IT security, or do so only at a very basic level. Certification guidelines are often not able to cope with the speed of development, for example the massive increase in genomic data that is now possible with the newest generation of sequencing machines. As a case in point, it is known that the “HIPAA does not consider genome sequences as identifying information that has to be removed under the Safe Harbor Method for deidentification”. With the new data amounts, genomic data cannot be considered deidentifiable anymore. You are responsible for establishing appropriate security levels and performing due diligence rather than relying on maybe unrelated or obsolete certifications.
The major benefit of public clouds is that, in some minutes, you will be able to use dozens of computers that can do work for you. This is a highly appreciated feature, in particular for research making use of large data volumes over short analysis “peaks”. From our experience, most routine labs are able to handle large data volumes and sequencing throughout themselves. We have seen labs with dozens of sequencers that process their data in-house easily, and probably have significant cost savings with respect to cloud computing. Often internal computer infrastructure for workstations, LIMS, medical records servers are maintained anyway. Also, you would perhaps be surprised what a single server with optimized software can compute: 30 exomes in a day are doable.
Fingerprints are, like genetic data, personally identifying information. Nowadays many phones include a fingerprint sensor making access to your mobile phone both secure and convenient. For this feature, your phone must be able to store and analyse fingerprints. How does it secure your fingerprint from potentially malicious apps installed on your phone? Does it maybe even send you fingerprint for backup to the cloud?
It turns out, phone developers took these concerns seriously and worked out a secure way to deal with fingerprints. Android phones are required to use a special isolated area on the phone, called TEE, to store and analyse fingerprint data. This often is a separate CPU with its own memory and own operating system, completely isolated from the rest of the phone’s system. When you register a fingerprint on your phone, the scan from the sensor is sent to the TEE which creates validation data and encrypted fingerprint data. With this encryption, it is not possible to make sense of this data even if some app would manage to get hold of it. Hence there is no way that the fingerprint ever leaves the phone. All the validations are done in the TEE, and only cryptographic proofs of the validation are delivered.
A lot of misbeliefs can be heard in the area of genomic data security, in particular in the field of cloud services. Cloud computing brings convenience: customers don't need to install software and can outsource IT services and infrastructure. But there might be cases where risks, data privacy regulations and practical considerations outweigh this convenience. The sensitive information intrinsic to genetic data in the genomics age is often overlooked. If you would hesitate transferring customer fingerprints, patient names and medical reports to the cloud, then you should make no exception to modern genetic testing data.
This article is available in other languages: German version