The Hero group focuses on building foundational theory and methodology for data science and engineering. Data science is the methodological underpinning for data collection, data management, data analysis, and data visualization. Lying at the intersection of mathematics, statistics, computer science, information science, and engineering, data science has a wide range of application in areas including: public health and personalized medicine, brain sciences, environmental and earth sciences, astronomy, materials science, genomics and proteomics, computational social science, business analytics, computational finance, information forensics, and national defense. The Hero group is developing theory and algorithms for data collection, analysis and visualization that use statistical machine learning and distributed optimization. These are being to applied to network data analysis, personalized health, multi-modality information fusion, data-driven physical simulation, materials science, dynamic social media, and database indexing and retrieval. Several thrusts are being pursued:

- Development of tools to extract useful information from high dimensional datasets with many variables and few samples (large p small n). A major focus here is on the mathematics of “big data” that can establish fundamental limits; aiding data analysts to “right size” their sample for reliable extraction of information. Areas of interest include: correlation mining in high dimension, i.e., inference of correlations between the behaviors of multiple agents from limited statistical samples, and dimensionality reduction, i.e., finding low dimensional projections of the data that preserve information in the data that is relevant to the analyst.
- Data representation, analysis and fusion on non-linear non-euclidean structures. Examples of such data include: data that comes in the form of a probability distribution or histogram (lies on a hypersphere with the Hellinger metric); data that are defined on graphs or networks (combinatorial non-commutative structures); data on spheres with point symmetry group structure, e.g., quaternion representations of orientation or pose.
- Resource constrained information-driven adaptive data collection. We are interested in sequential data collection strategies that utilize feedback to successively select among a number of available data sources in such a way to minimize energy, maximize information gains, or minimize delay to decision. A principal objective has been to develop good proxies for the reward or risk associated with collecting data for a particular task (detection, estimation, classification, tracking). We are developing strategies for model-free empirical esitmation of surrogate measures including Fisher information, R\'{e}nyi entropy, mutual information, and Kullback-Liebler divergence. In addition we are quantifying the loss of plan-ahead sensing performance due to use of such proxies.
- Geometric embedding of combinatorial optimization. One of the major roadblocks to making scientific progress in solving grand challenge problems is the curse of dimensionality, This problem is especially acute in combinatorial optimization where the behavior of the objective function under permutations and combinations has no obvious geometric structure. Remarkably, smooth geometric structure emerges as one allows the domain dimension to grow in many Euclidean combinatorial optimization problems including shortest path through a similarity graph and multiobjective pattern matching. This geometric embedding can lead to approximate solution of the combinatorial problem via solution of a simpler variational continuous optimization problem. Further progress in this field could lead to general combinatorial solvers that utilize the considerable machinery available in scientific computing, e.g., general ordinary differential equation (ode) and partial differential equation (pde) solvers. Grand challenge problems that could benefit from this research include: monitoring pandemics (path analysis on epidemic proximity graphs); energy and transportation (optimal routing); and adaptive drug design (computing Pareto frontiers); to name just a few.

These areas arise in the context of several sponsored projects in the Hero lab including the following:

- Mathematical approaches to representing spatio-temporal data, including astronomical data, network data, biomedical diagnostics and predictive health (M_Cubed, DARPA-PHD, NSF, NIH-P01). We are developing methods for high throughput analysis of biomarker data. One project, funded by DARPA (ended in 2014), aims to predict health and disease propagation (epidemics) over a close knit human population based on a combination of genetic, metabolic, and social network data. Another project, funded by NIH, is developing image registration methods that are capable of compensating for patient motion and multimodality distortions. Another project, funded by NSF, is developing machine learning methods that can handle data that comes in the form of distributions. An example is flow cytometry data where a each cell in a blood sample is assayed and assigned a multidimensional label, including antibidy, protein binding, and morphology labels. Another example, funded by a M-Cubed in collaboration with faculty in the Depts of Astronomy and Physics, is astronomical data where the measurements are spatial point process realizations (stars) or spatial and spatio-temporal image patch realizations (solar metrology). In each of these areas we are developing approaches based on high dimensional data analysis, adaptive sampling (when to take an measurement or assay and where to take it from), large scale statistical inference, and multimodality data/information integration. Another project (DOE) is funding us to develop analysis tools for nuclear non-proliferation treaty verification using on-site and remote data collection strategies to monitor declared facilities and detect undeclared facilities.
- Subspace processing for imaging and information fusion (NIH-P01, ARO-MURI, AFRL-UES, AFRL-ATR). Subspace models are models that are sparse in a basis spanning a low dimensional subspace of the data. Such models allow fusion of multi-modality data without overfitting and accomplish denoising of high dimensional datasets. Several methods have been pursued here including dictionary learning, non-negative factor analysis, and measure transformation generalization of PCA, ICA, and CCA that allow non-linear components to be captured in the original coordinates (unlike kernel methods of PCA, ICA and CCA). These methods have been applied to different projects including spatio-temporal gene expression analysis, predicting health and disease, materials science, video imaging, radar imaging, and social nets.
- Network measurements and analytics (ARO-Social, NSF, AFOSR). We are developing distributed models and methods for analysis of high dimensional spatio-temporal network data. One focus, previously funded by NSF, is on Internet data analysis, including flows (TCP, UDP, etc), application level (email, http), and transport (end-to-end delays, packet losses) to detect anomalies and reconstruct topology of the network. Another focus, funded by ARO, is on emergent behavior analysis on social networks where we are developing methods to aggregate different layers of social network information, including behavioral and relational edge attributes. Under an AFOSR grantm we are considering the problem of reliably estimating structural properties of graphical models in sample-starved network data collection regimes. These areas are being pursued in the context of applications such as network tomography, topology estimation for distributed activity detection, target tracking, event classification. Modalities that have been investigated are: Internet traffic data, email data, fMRI brain activiation, 12 lead EKG monitoring and diagnostics, and gene regulation networks.
- Database indexing and retrieval (ARO-Databases, ended in 2014). In this ARO funded project, the objective is to develop methods based on sparsity and dimensionality reduction for searching large multimedia databases of images and videos. This involves the development of scalable methods of feature selection, similarity matching, and spatio-temporal modeling that can improve precision and recall performance. See webpage (.html) for a short bibliography on this topic. Current areas of focus are event detection and correlation in videos, pose estimation and 3D shape retrieval, multimodality retrieval using information theoretic measures, and multiple criteria image search. Some applications areas that we are currently considering are: automated recommendation systems, human-in-the-loop indexing and retrieval, and.
- Value-centered information-driven sensing (ARO-MURI). Under this ARO funded project we are investigating ways to improve the value of information delivered to the end-user in sensor networks that are limited by finite bandwidth, temporal dynamics, latency, and other factors. We are developing new models that account for these network limitations and for which performance guarantees deliver more timely and relevant information to the operator of the sensor network. This project involves sensor modalities including radar, sonar, video, LIDAR, and atmospheric sensors. One of our main focii is developing an information theory that accounts for human-in-the-loop. One of the principal applications being considered is centralized and decentralized cooperative target search: by querying a pair of human and machine sensors to localize an image, classify a scene, or to estimate the position of a weak target.

### Patents

- G. Ginsberg, J. Lucas, C. Woods, L. Carin, A. Zaas, and A. Hero, “Methods of identifying infectious disease and assays for identifying infectious disease,” US Patent 8,821,876. Filed May 22 2010. Issued Sept 2 2014.
- A.O. Hero, K. Carter, R. Raich, and W. Finn, “Method and apparatus for clustering and visualization of multicolor cytometry data,” US Patent 7,853,432. Filed Oct 7 2007. Issued Dec 14 2010.
- A. Hero, H. Neemuchwala, P. Carson, “Method for determining alignment of images in high dimensional feature space,” US Patent 7,653,264. Filed Mar 4 2005. Issued January 26 2010.
- W.J. Williams, E.J. Zalubas, J.C. O’Neill, R.M. Nickel, and A. Hero, “Method and system for extracting features in a pattern recognition system,” US Patent 6,178,261. Filed 5 Aug 1997. Issued Jan 23 2001.