Creating tools for large scale data analysis and social discovery

We know that Big Data is already changing our world from health, to storage, to communications, but what is needed to accelerate discovery?  Dr. John Canny's group at the University of California, Berkeley provides an approach to big data by developing a new way in which computers process and analyze big data.

His team is developing a new tool called the BIDMach to analyze big data using machine learning on very large datasets.  BIDMach is unique in several ways often out matching other available tools by several orders of magnitude.  The tool can scale to thousands of nodes with 'near-linear' speedup. For math-oriented users its intuitive, like writing math as code and for non-math users it's interactive and works on a high-level by developing visualization tools to make the 'modeling process' itself interactive; helping to develop insights and intuition about the data. And finally, it focuses on a single-machine acceleration first and cluster scale-up second. This approach is important because it takes a different direction that current assumptions; namely that cluster computing is required to approach big data problems and focuses instead on a single-machine performance first leveraging both the CPU and GPU hardware. Evidence has shown that this single-machine performance approach is typically larger than clustering.

 

  • Big Data and Analysis has become a key part of our world in business to science to health. We need better tools to helps us understand and gain insights into the data for useful application. These tools will help in new discoveries when computation of big data is actually a barrier to discovery. BIDMach allows researchers to do now with one machine what would have required 100's of machines in our recent past. This will lead to quick analysis, insights and accelerate discovery. It also means projects with limited funds and need of data analysis can succeed where they could not before due to the cost of computing resources.  
     
  • Potential areas of growth may be seen in the information, services and tools for health empowerment and outcomes. By using technology to collect and analysis large amounts of data we can start to see distinct areas of immediate, provide tools for potential improvement by use of technology and ability to use data to show decided results.
     
  • An immediate benefit in big data performance is the significant energy savings that big data analysis typically requires. BIDMach indicates a two to three order of magnitude reduction in energy use for large data analysis problems. As cloud computing currently consumes about 2% of our US energy budget, advances in speed and efficiency of data analysis will see significant savings in energy.

Funding will allow Dr. Canny and his team to explore many directions that they believe will have very large impact, but which Dr. Canny and his team have been unable to make progress on due to limited resources. These include:

1. New, hardware-accelerated tools for genetic analysis. Dr. Canny and his team have built proof-of-concept code for e.g. genetic sequencing with a two order-of-magnitude improvement in performance, but have been unable to see this through to a full system.

2. Scaling deep Neural Network training to billions of images. They have POC benchmarks indicating very large speedups but integrating with a DNN toolkit is a major task.

3. Natural language tools. Dr. Canny and his team recently developed the fastest natural language parser, and there are many other opportunities for hardware-leveraged NLP. In particular, integrating DNNs and recurrent NN's in BIDMach would leverage work.

Dr. John Canny, a former researcher in computer vision and robotics, applies his expertise to social data science in behavior change, health care, education and computational science.  His 'data mining activities' include work for MarkLogic, BigTribe, Yahoo, EBay and Quantcast with almost 40% of his time in industry-related endeavors since 2003.

 

His ideas in the behavioral data landscape have shaped the backbone of BIDMach with the concept "behavioral data is big but not outrageously so can can fit in a small box."
 

Big Data analysis requires processing large amounts of data but the art of discovering structure and value is also available - this is where innovation happens.  Dr. John Canny's work continues to shape big data analysis by the ability to visualize and analyze raw data, quickly conduct hypothesis testing with sketch models, test against samples of big dataset, easily migrate to large scale and an ability to get performance data on a scaled-up model.  This allows researchers agility to ask lots of questions and test hypotheses.

The BIDMach design ideas include:

  • Rooflining: designing to approach the limits of the hardware
  • Modular: minibatch architecture for more rapid model optimization
  • New Communication Primitives: cluster scale-up needed to support single-node speeds
  • New memory management strategies: avoid conventional memory allocation entirely

operator that uses a multi-stage algorithm to detect a wide range of edges in images

Machtey Award

1987