Retrieving and sharing bioassay data can be a difficult and time-consuming process for researchers due to the large volume and complexity of the data generated, lack of industry standards and limitations of current data models. As part of efforts to address some of these issues and enable greater data sharing, the Pistoia Alliance recently launched the second phase of its DataFAIRy project.
Technology Networks had the pleasure of speaking with Dr. Vladimir Makarov, project manager of the Pistoia Alliance Artificial Intelligence and Machine Learning Centre of Excellence, to learn more about the DataFAIRy project, its aims and importance.
Anna MacDonald (AM): Can you tell us about the origins and aims of the DataFAIRy project?
Dr. Vladimir Makarov (VM): Managing and utilizing unstructured data has long been a challenge in scientific R&D. As the volume and variety of scientific data continue to grow, the problem is becoming more complex.
For example, bioassays protocols – the “how-to” information found in the “Methods” sections of publications, and that constitutes the assay metadata – represent the type of data that until recently only existed in the form of unstructured text. This is a major barrier to innovation. It is hard to find, evaluate and validate assay protocols. Our interviews show that scientists spend a long time – up to 12 weeks per assay – to select and plan new experiments. Assay protocols become obsolete. Reproducibility of research suffers. Assay metadata is also a popular data type for post hoc data mining. Not having it readily available in a FAIR (Findable, Accessible, Interoperable, Reusable) format makes this research hard as well.
This is why the Pistoia Alliance developed the DataFAIRy project. We use a “human-in-the-loop” approach where the output from an automated natural language programming (NLP) engine is vetted by human experts. Results are then used for the continuous improvement of the NLP software. The annotated bioassay protocols are deposited into PubChem, a major public data resource, where they are freely available. Access to this information by the scientific community will help speed up R&D so that products such as new drugs can be brought to market faster.
AM: What are the FAIR guiding principles and why are they so important?
VM: A lot of valuable data are currently siloed in different formats and locations. This makes it extremely difficult and time-consuming to retrieve and share it – rendering it essentially unusable. The FAIR principles set out to overcome this. First developed in 2016, FAIR standards offer organizations guidance on how to record and store the information they generate, so that it retains its value.
Specifically, there is an emphasis on making data machine-readable. This means computers can automatically find and action relevant data to reduce the burden on scientists. Making data FAIR also improves its quality, enabling better implementation of artificial intelligence (AI) and machine learning (ML) methods across the industry. FAIR data also helps to facilitate collaboration by making data interoperable. As the past year has shown us, collaboration and digitalization are essential for driving breakthroughs.
AM: How has bioassay protocol data traditionally been recorded? What problems are associated with this?
VM: Like many types of lab data, assay protocols exist in plain-text formats. Currently, there are more than 1.3 million biological assay protocols, including published papers and bench notes. Most of these data are partially annotated in public data banks, but the depth and quality of these annotations is not good enough for the data to be used in automated mining or be applied to answer new business questions.
For the scientist, this is a problem because they must spend time manually sifting through vast libraries of old publications, rather than conducting new research. Finding information on the specific experimental conditions requires extensive, and therefore expensive, expert review. Errors in assay descriptions travel from one publication to another, making research hard to reproduce. Some experiments that have already been known to fail are unintentionally repeated. Meta-analysis of the already accumulated data is also hard. In turn, these issues end up causing delays to invention of new drugs and ultimately impacting patients in the long term.
AM: How will the DataFAIRy model automate data annotation?
VM: We firstly conducted an extensive analysis of needs of a typical scientist in the pharmaceutical industry. Then we developed an ontology-based data model that would enable one to answer the typical data mining questions. We use this model in an NLP software application that allows us to conduct expert review of the extracted values, and then learns from the expert reviews. This “human-in-the-loop” approach guarantees high quality of the output annotations. Currently, we are thinking of ways to scale up the annotation process by up to 100-fold. It may require new approaches to both ML and the human user interface design.
AM: What benefits will the model bring to scientists? How will the project aid the successful adoption of AI in the life sciences?
VM: The main benefits are improvements to research quality and reproducibility, ease in publication of research methods and in selection and validation of assay protocols for future research. We also see potential to streamline regulatory submissions. Together, this means that the productivity of scientists will rise. This is essential as the cost of developing new drugs continues to rise.
AM: What are the next steps for the project?
VM: We have two main objectives for the next phase of the project. Firstly, to scale the annotation process up by 10 to 100-fold, moving from single hundreds to tens of thousands of assay protocols. Secondly, to develop a standard based on our assay protocol data model, and to promote it in the industry and in the scientific community at large. We would like to involve reagent vendors, publishers and leading academics in this process and are keen to hear from them. In the long term, our work would enable greater data sharing between organizations and help scientists cope with the growing volume and complexity of data being generated.
Dr. Vladimir Makarov was speaking to Anna MacDonald, Science Writer for Technology Networks.