HIVE Collaboratory Leaders Report Progress on HuBMAP Infrastructure
HuBMAP Open SciTech Webinar, March 28, 2022
As the HuBMAP Consortium completes its fourth year of funding and shifts into its production phase, the HuBMAP Infrastructure and Engagement (HIVE) Collaboratory’s leadership gathered to provide an update of its progress. Their presentations marked the beginning of the annual Open SciTech Webinar, involving collaborators both within and outside HuBMAP.
Flexible Hybrid Cloud Infrastructure for Seamless Management of HuBMAP Resources Project
Pittsburgh Supercomputing Center’s Phil Blood, Co-Principal Investigator (PI) with Jonathan Silverstein at the University of Pittsburgh of the Flexible Hybrid Cloud Infrastructure for Seamless Management of HuBMAP Resources Project in HIVE, presented on the project’s development of a hybrid cloud/HPC microservices infrastructure for HuBMAP users. To date the group has published 1,167 HuBMAP datasets, including 784 primary datasets as well as derived sets — up 136 from March 4 through 25.
To enable scientists to maximally leverage HuBMAP data and “reduce time to science,” Blood said, the team has focused on building a flexible analysis ecosystem. By co-localizing data, tools and computational resources in a seamless way, the aim is to support highly customizable workspaces to bring together open resources and other tools with minimal upfront re-training. The key goals for the ecosystem will be to support diverse software, storage and compute resources in a way that is easily deployable at other sites. This will include a fully public cloud-based deployment in parallel with an on-premises deployment, as well as a hybrid deployment allowing users to leverage the strengths of each.
Silverstein presented the group’s progress with two new user tools. The HuBMAP Command Line Transfers Tool is designed to help researchers bulk download data more effectively and quickly. Using the HuBMAP ID of a dataset and the directory or file in that dataset desired for download, the user will be able to create a manifest file with multiple lines specifying hundreds of specific files or directories through Globus. The second tool he described, the Antibody Validation Report (AVR) Repository, provides a single application for members of the consortium to upload AVR reports for antibodies used by HuBMAP members, including header and metadata. The tool will be offered open-access to the research community.
The ecosystem is currently under development, having successfully launched a JupyterLab session. The front-end portal to interact with this service is in design in collaboration with the Harvard Tools Component. Next steps will include deploying and integrating the service with HIVE resources.
Human Reference Atlas
Katy Börner, PI of the HuBMAP Mapping Component-Indiana University (MC-IU), presented on the progress made by a collaborative team, including MC-IU, Stanford, Harvard and the European Molecular Biology Laboratory, with the Human Reference Atlas (HRA).
A central HuBMAP goal, the HRA defines the three-dimensional space and shape of biomedically important anatomical structures and cell types, including the biomarkers used to characterize them. A key element of the HRA is that it defines how new datasets can be mapped to it spatially or semantically, providing an open, continuously evolving resource with authoritative, computable data. To date the HRA has incorporated data from 16 HuBMAP consortia studying over 10,000 anatomical structures in 30 organs.
The group is currently collaborating on using Linked Open Data/Semantic Web Standards to ensure consistent ontology development and reasoning in the datasets. The next generation of this effort adds spatial data in support of spatial search and reasoning, tracks donor specimen info and captures links to experimental data and evidence as reflected in 295 unique publications to date.
HuBMAP HIVE Tools Component, Harvard Medical School
Nils Gehlenborg of Harvard Medical School, PI of the HuBMAP HIVE Tools Component there, updated attendees on the group’s progress with portal development. Currently live, the changes now allow for metadata exploration within the portal. The new tools will allow users to download data spreadsheets as well as use a number of visualization tools to analyze and understand the data in terms of donor age, sex, blood type, organ system, medical details etc. Users will also be able to upload their own datasets to analyze with HuBMAP data.
A key idea, Gehlenborg said, is that users are able to create downloadable Jupyter notebooks that save the data and analysis for further work as well as sharing, either through the portal or the downloaded notebook. The download also pulls in metadata for the selected datasets, allowing full analysis of the downloaded notebook. In the near future, the collaborators will offer analysis within the portal as well, enabling thanks to the computing resources offered by HuBMAP larger-scale analyses than possible on a user’s computers. Also soon to be available are notebook templates to aid in specific types of analysis, as well as a UI that improves display of data for smaller device screens.
Gehlenborg also provided an update on ongoing visualization tools developed by the Tools Component. He focused on Vitessce, developed by the group to provide users with modular visualization components as a standalone Web application, embedded in another Web application or as a Python or R package. The team is now developing a UI for workspace management, possibly released this quarter.
Other work in progress includes complete implementation of workspaces in the HuBMAP Data Portal, finalizing implementation of advanced metadata search and support for molecular and cellular queries. The latter function is now in production. Further-future releases may include AI-based visual exploration tools.
HIVE Tools Component, CMU
Matthew Ruffalo, Interim Contact PI of the HIVE Tools Component at Carnegie Mellon University, updated attendees on that component’s progress. The CMU Tools Component designs, develops and implements methods robust methods to analyze process and model high-throughput single-cell omics and imaging data. To date the group has implemented pipelines for all major HuBMAP data types, including single-cell RNA sequencing (scRNA-seq) data, single-cell transposase-accessible chromatin sequencing (scATAC-seq) data and CO‐Detection by indEXing (CODEX) data from multiple platforms or tissues, as well as downstream analysis of single-cell imaging data from multiple platforms Leveraging their expertise in advanced computational methods including deep learning and graphical models, imaging analysis and combinatorial optimization, in this year the group has accomplished:
Azimuth: Reference-Based Mapping of the Human Body at Single-Cell Resolution
Rahul Satija of New York University, PI of the HIVE Comprehensive Reference Map Construction, Geolocation and Data Integration Mapping Component, presented his group’s progress with the Azimuth mapping tool. Using human genome mapping as a conceptual framework, the team is using Azimuth to map data to single-cell references, enabling integration, reuse and comparison of datasets across labs and consortia. By projecting multiple datasets from HuBMAP into a consistent ontology, the collaborators intend to improve user ability to interpret datasets in an open-source, fully automated, scalable and accessible way.
To date the group has mapped 170 million cells to 10,000 datasets in eight currently available reference organs. An additional four organs are in beta testing. Azimuth has been implemented as part of the HuBMAP RNA data ingestion pipelines. The tool offers fully automated and computationally efficient mapping, improving users’ ability to annotate rare subpopulations and encouraging data standardization and comparison.
Challenges that the group intends to address include the current modality limitations of Azimuth, which currently only includes scRNA-seq data. The team will place a high priority on expanding their tools to other HuBMAP imaging modalities. A promising avenue to accomplishing this, Satija said, is Bridge Integration via Dictionary Learning. Because different modalities often measure different biological properties, a direct mapping of cell types via two modalities — Satija used the example of scRNA versus scATAC — is difficult. Their Bridge Integration uses the small subset of cell types mapped in both scRNA and scATAC as a bridge, allowing the known scRNA reference to be mapped onto the scATAC query data in a way that allows unsupervised analysis that would have previously impossible.
The group’s data suggest that Bridge Integration is possible with as few as 50 cells in the bridging dataset, opening up the method in a potentially large number of modalities. Bridge Integration has demonstrated clear feasibility on roughly 1 million cells, with the potential to scale beyond 100 million. The researchers hope that the method will enable community-wide integrative analysis of all available mapping data.