Crack ‘AI for health’ puzzle

Thursday, 26 September 2019 | Abhinav Verma Vivek Eluri

Crack ‘AI for health’ puzzle

Thursday, 26 September 2019 | Abhinav Verma Vivek Eluri

Unlocking Government datasets is necessary for training and testing Artificial Intelligence systems in Indian healthcare from a regulatory standpoint

Back in 2006, British mathematician Clive Humby famously said, “data is the new oil,” in the context of Tesco’s Clubcard Loyalty programme, a basic marketing gimmick. Little did he know that five years on his words would take a life of their own with the onset of Industrial Revolution 4.0 that is largely driven by Artificial Intelligence (AI), Big Data and the internet of things (IoT). As the world stands at the brink of an AI-disruption, data is all the more critical.

Simply speaking, an AI programme is a computer code that can learn from real world situations and tweak itself to perform more efficiently. This requires that it be fed large datasets to learn, find patterns and draw conclusions. This enables the AI to perform its task better, with accuracy that enhances over time as more data is fed in. As AI is being developed to solve increasingly complex problems, especially in healthcare like diagnosing a patient, impeccable accuracy is imperative. This in turn requires developers to have access to large amounts of representative and quality data.

Recently, NITI Aayog’s proposal to develop an institutional AI framework that identified health as a core area was approved by the Finance Ministry. In the coming few weeks, the Union Health Ministry is also expected to release the final National Digital Health Blueprint (NDHB). These events are set to move conversations around AI in healthcare to the centre stage, but the question of relevant data for development of AI systems for public health is yet to be answered comprehensively. Two dimensions of this question have to be considered — the data available and how to access it?

While clinical data in the form of longitudinal Electronic Health Records (EHR) would be critical for future development of this industry, mass adoption of interoperable EHR systems is moving slowly in India. In the interim, disease registries, service and logistics tracking data and repositories of radiology scans might all serve as starting points to develop useful AI applications. These applications have to be trained on large datasets that are usually available with the Government through its national programmes and public health facilities. Training on Government datasets also ensures universal, error-proof applicability. However, these remain inaccessible to academia and the private sector alike.

Government as a gatekeeper: As compared to other sectors, health data is highly sensitive and requires the Government to protect it by limiting access. However, as the open data movement takes hold the world over, India has also been making attempts at unlocking its wealth. The National Data Sharing and Accessibility Policy (NDSAP) was released to manage the release of Government data, but it has not been implemented equally across Government vestiges. The data platform set up thereunder — data.gov.in — still suffers from the lack of critical updated datasets that are granular and clean enough to deliver value. Most departments still haven’t uploaded their mandatory minimum of five datasets and those who have, upload top-level metadata. There are multiple concerns that the Government has in unlocking data, some legitimate but easily solvable and others deeply embedded, which require a shift in culture. Due to the lack of a comprehensive data privacy legislation, even officials who understand the potential of open data shy away from taking initiatives to unlock it. This is also complicated by the recent global scandals relating to commercial exploitation of personal data by mega tech giants, making officials wary of companies dealing in data. Unlocking data has another unintended consequence — it brings to the fore concerns regarding the quality of the data collected, especially when compared with parallely collected data through different programmes. This even thwarts efforts of inter-ministerial data sharing, let alone making datasets publicly available.

One issue with most datasets that have been digitised is machine readability. Scans of handwritten physical documents are illegible to machines without proper annotation, such as labelling key headers. The task of making even some of the Government data machine readable is cumbersome but the benefits in terms of cost-savings in resource planning and efficiencies in clinical care and public health at large, make it a compelling case.

Concerns around commercial innovations on citizen data: For those select few within the Government, who view data as an asset having commercial value, the natural instinct is to prevent what is primarily a public good from private sector exploitation. Whether data collected by the Government should be made available for free is a polarising question. On one hand, it can be argued that the data generated through citizens should be freely released to companies so that they can innovate and in turn provide better services. However, this school of thought doesn’t account for the fact that while companies make commercial gains from the citizen’s data, the people are neither given financial or other benefits at the point of data-sharing nor at the point of receipt of the final service. Thus, engagement with patients and incentivisation is necessary so that they can be convinced to share their data using comprehensive consent managers and opt-out options. This adds to the cost burden to the Government of creating datasets for sharing.

In fact, in making its data useable, the Government would spend immense costs in annotating and cleaning the data. Open data as a principle is not opposed to levying a fee for use of Government datasets. Certain base datasets can be released free of cost while premium sets can be made subject to payment. However, distinctions must be made between non-commercial and commercial use in terms of user fees. Models that charge commercial entities in order to adequately subsidise non-profit access to datasets can be explored. Outside of user charges, innovative models such as an obligatory public service clause can also be introduced in data-sharing agreements for innovators using public data, which can ensure public benefit at large.

Moving swiftly to find this piece of the jigsaw: Earlier this year, AI consultancy Oxford Insights AI Readiness Index pointed out that while India sits at a comfortable 19th rank, it severely lags behind its Asian competitors like China and Singapore in open data and data availability. Advancing India’s global leadership will require the Government to act swiftly to develop data-sharing collaborations and protocols. Smaller datasets and non-commercial collaborations might be a safe first step.

Globally, medical colleges and universities are leading the way in making smaller datasets publicly accessible in order to spur innovations. The Stanford ML Group’s open datasets like CheXpert, for instance, allows participants access to over 14,000 standardised high-resolution chest X-ray scans to build and test their AI algorithms. However, such open datasets will continue to have limited use and Government (or Government- approved) datasets are necessary for training and testing AI systems from a regulatory standpoint. The Government is well aware of this necessity and therefore, NITI Aayog tied up with Department of Biotechnology to set up an imaging ‘biobank’ for cancer. How much of the data will be meaningful and open for all remains to be seen, but initiatives like these are crucial.

Traditionally, data is seen as something to be protected and kept hidden instead of a resource that could bring about economic and social value when optimally utilised. Once seen as a public asset, data sharing will kick-start automatically. Environment like data sandboxes or creation of small data lakes to allow innovators to train their AI must be explored. In March 2019, CMS, the US federal agency overlooking Medicare and Medicaid programmes, launched the AI Health Outcomes Challenge, which invited AI solutions for predetermined problems using certain Medicare claims datasets. These confidence-boosting steps show bureaucrats the value emerging technology applications can bring.

Unlocking data also means exploring watertight ways of preserving the privacy, security and traceability of data. Advancement in technologies as distributed ledger technology (blockchain is a popular form) hold answers to problems such as ensuring single use when datasets are shared. Technologies such as homomorphic encryption may enable training on a dataset without having to expose the sensitive parts of the dataset. Debates around tactical details of data sharing will go on, but one thing is clear — there is no value in keeping data locked. It is pertinent that we now move beyond mere conversation and actually propel action to unlock meaningful data.

(Verma is a lawyer and policy consultant and Eluri has led digital transformation projects in the pharmaceutical industry. Both are with the International Innovation Corps, Chicago University.)