Understanding and visualizing large text databases with fast machine learning

In an age of information overload, we depend on search engines like Google to scour the billions of text documents and websites to find that one video, document, or picture to suit our needs. Despite the power of search engines like Google and Yahoo, our search queries often produce convoluted search results that are difficult to browse and are often irrelevant. Dr. Laurent El Ghaoui of University of California, Berkeley focuses on developing new and efficient tools that will lessen the information overload by providing sophisticated search results based on sparse optimization and machine learning.

 

Dr. El Ghaoui's research is tackling information excess and data discovery. The platform StatNews.org provides a service to  researchers from social sciences and humanities, allowing them to obtain a summarized image of a given topic from a large database of news, like that of the BBCNew York Times or Wall Street Journal. Users can view information in an interactive graph that shows key terms pertaining to the topic and their change in prevalence over time. StatNews is able to detect trends in language and media coverage of news, and develop historical perspectives. It is also able to uncover how countries, diseases or regions are portrayed in international news. The technology is extremely scalable, as it can summarize tens of thousands of documents in a few seconds on a small laptop computer, and can be applied to any text, in any language. It uses a sophisticated algorithm that gives the user the context in which a query word like "Obama" is mentioned in the news and gives users a summarized version of how the topic has evolved over time. Try it here: http://statnews.org/

 

 

  • Dr. El Ghaoui's hopes to make text database navigation easier, and give the user the ability to dig deeper and find meaning faster. At the same time, the project attempts to tame the news overload problem and give users a longer-term perspective.
  • The focus of the platform is on news corpora, and has recently elicited the interest of the British Broadcasting Corporation, who has provided a large database of their news articles.
     
  • The technology is extremely scalable (similar to Google's PageRank), as it can summarize tens of thousands of documents in a few seconds on a small laptop computer, and can be applied to any text, in any language.

 

 

Dr. El Ghaoui's research is developing illustrative applications of the technology in close collaboration with researchers from social sciences, medicine and public health. Current collaborations and projects include:

  • The Tobacco Litigation Library at the University of California, San Francisco (UCSF) provides a vast trove of internal documents (email, letters) emanating from "Big Tobacco" corporations. We are beginning a project to allow researchers across the world to mine the database more efficiently; our initial study involves the topic of electronic cigarettes.
     
  • Dr. Sarkar from UCSF's Center for Vulnerable Populations, plans to study how diseases are portrayed in the news, and offer guidelines to the Center of Disease Control as to how to best explain health issues (such as cervical cancer prevention) to the population.

Bio

Dr. El Ghaoui graduated from Ecole Polytechnique (Palaiseau, France) in 1985, and obtained his Ph.D. in Aeronautics and Astronautics at Stanford University in March, 1990. He was a faculty member at the Ecole Nationale Suprieure de Techniques Avancees (Paris, France) from 1992 until 1999 and held part-time teaching appointments at Ecole Polytechnique in the Applied Mathematics Department and at Universite de Paris-I (La Sorbonne) in the Mathematics in Economy Program.
 

In 1998, he was awarded the Bronze Medal for Engineering Sciences, from the Centre National de la Recherche Scientifique, France. Dr. El Ghaoui joined the University of California, Berkeley faculty in April 1999 as an Acting Associate Professor, and obtained his tenure in May 2001. He was on leave from July 2003 to 2006 to work for SAC Capital Management, a hedge fund based in New York and Connecticut.
 

Laurent is a full Professor having appointments in the departments of Electrical Engineering and Computer Sciences and Industrial and Operations Research. He also teaches at Haas Business School in the Masters of Financial Engineering program.   

Awards

Co-recipient of a SIAM Optimization Prize (2008)

AWARDS AND LECTURES SPONSORED BY SIAM