Utilizing mathematical approaches to analyze large-scale data

Whether measuring stars on the other side of the universe, the fitness and health status of individuals, or the behavior of users on social network or Internet commerce sites, data generation and analysis are the modern version of the microscope or telescope, as they permit us to obtain a remarkably fine way to view the physical, biological, or social world. To deliver on that promise, however, we need not only scalable algorithms to analyze these data, but also novel statistical methods to control inference properties, e.g., false positive behavior, of these scalable algorithms. Developing unified mathematical approaches, general enough to work across domains but specific enough to provide useful qualitative guidance in particular applications, promises to have impact in these and other areas. Dr. Michael Mahoney, Associate Adjunct Professor at University of California, Berkeley and Senior Research Scientist of International Computer Science Institute, focuses on the algorithmic and statistical foundations of modern large-scale data analysis. In addition to a healthy dose of mathematics, from theoretical computer science and statistics to more traditional applied mathematics and numerical analysis, this involves implementations of matrix and graph algorithms, as well as the application and use of these methods to many different areas such as Internet data analysis and social network analysis.

Traditional academic research happens with most professors who are trained in one area and work exclusively in that area. Due to his eclectic interests and broad set of experiences, however, spanning a range of industries as well as multiple diverse academic areas and application domains, Dr. Mahoney performs the truly interdisciplinary research that is needed to create the algorithmic and statistical foundations of modern large-scale data analysis. His research has been described as neither computer science, nor statistics, nor applied mathematics, nor any particular application domain, but instead where all those areas are trying to go. For example, Dr. Mahoney’s work on scalable randomized linear algebra algorithms can help analyze DNA models in population genetics to determine the histories of human migration and in medicine to make predictions for personalized medicine. In Astronomy, Dr. Mahoney’s work in implementing and applying scalable matrix and graph algorithms is useful in developing better tools for performing computations for some of the largest non-Internet data like hyperspectral imagery from telescopes. Such a project also finds its use in medical imaging and cloud analytics, as it seeks to enable individual scientists to perform state-of-the-art analytics on large corpora of mass spectrometry images stored in the cloud. Not to mention, graphs and networks are a common way to model data such as friendship links on a social network. Dr. Mahoney’s recent work focuses on characterizing the spread of information in a much more refined way, and he has been a leader in developing principled tools to characterize the structural properties of large social and information networks.

Dr. Michael Mahoney and his team have three major areas of focus:

  • Randomized Linear Algebra: Linear algebra underlies much of statistical data analysis, and this work involves using random sampling and random projection methods to solve very large matrix-based problems, both in worst-case theory, in RAM computations, and in parallel/distributed environments. Whether they are feature vectors or graphs, matrices have been a common way to model data that is more sophisticated than flat table models. While there is a lot of work historically on matrix computations motivated by numerical simulations, the problems that are encountered in those applications are very different from the problems that are encountered in life sciences, physical sciences, as well as Internet and other large-scale data applications. Randomized linear algebra algorithms are useful in developing better algorithms for fundamental matrix problems, in many of these large-scale machine learning and data analysis applications.
  • Geometric network analysis: This involves using scalable approximation algorithms with a geometric or statistical flavor to analyze the structure and dynamics of large informatics graphs, e.g., large social and information networks. This can be used, for example, to determine whether there exist good clusters and, if they do exist, whether they can be used to obtain better algorithms for clustering, recommendations, classification, and prediction. A canonical example of its usage is social networks, where different individuals have different agendas. On one hand, you may have social marketers trying to encourage viral propagation on social network applications; on the other hand, a huge amount of people make recommendations based on these social graphs, or use social graphs to control virus propagation. A lot of this boils down to structural properties of a graph, as graphs have a number of properties that can characterize small-scale structures, large-scale structures, and intermediate-scale structures. By studying how the various local properties of the graph interact with global properties, Dr. Mahoney and his team will be able to get more precise statistical theory that can capture the properties of broad classes of data.
  • Implicit regularization and implicit optimization methods: This involves characterizing implicit regularization properties underlying scalable approximation algorithms applied to real data. It is an effort to bridge the gap between the "oil and water" approaches that computer science and statistics bring to their analysis of data. The goal of these methods is to characterize what computations people actually perform and what people are optimizing.


Dr. Michael Mahoney has always been interested in scientific and mathematical questions and in asking questions than other people don’t ask. He had the good fortune of having a summer internship at the National Institutes of Health as an undergraduate, which led him to focus his pre-dissertation and dissertation research on experimental/computation genetics as well as computational biophysics and statistical mechanics. The centerpiece of Dr. Mahoney’s dissertation involved the development of TIP5P, an interaction site model for liquid water that was as good as the best previously-existing models but that quantitatively reproduced the most unusual property of liquid water, namely its anomalous density as a function of temperature. This involved an enormous amount of Monte Carlo and molecular dynamics computation, as is common in computational science and scientific computing. One of the things that struck him was that in each of these areas there was a growth phase, where technological developments in measurement capabilities and an inconsistency with previous theory meant that a new field needed to be created. Upon completing his dissertation in 2000, Dr. Mahoney decided that the next large growth area was going to be in the large-scale data analysis space. He then switched areas to work in theoretical computer science and the theory of algorithms, and migrated through statistical data analysis and Internet analytics at Yale, Yahoo, and Stanford to work on the theory and applications of modern large-scale data analysis at ICSI and University of California, Berkeley.

Other than mathematics and science, Dr. Mahoney particularly enjoyed history while growing up. Another fond memory of growing up was his summer activities. During the summer, he would often go out with his cousins on boats and observe the ocean and the fish; and he used to go scuba diving, although he does not get to that as much as when he was younger. Motivated by curiosity, Dr. Mahoney likes to go places not many have gone, and desires to constantly explore new possibilities that can pave the road for profound breakthroughs.

For more information, visit http://www.stat.berkeley.edu/~mmahoney/


Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments


Think Locally, Act Locally: The Detection of Small, Medium-Sized, and Large Communities in Large Networks


Randomized Algorithms for Matrices and Data


CUR Matrix Decompositions for Improved Data Analysis


Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters