Learning to detect malicious activity on the Internet

Over 2.4 billion people access the Internet around the globe, and tens of millions use popular web services such as Google, Facebook, and Twitter on a daily basis.  While the Internet continues to integrate with our everyday lives, few truly understand the perils that lurk only one click away. Many Internet users are unable to distinguish a scam from a legitimate email, web page, or web service. The Internet offers a highly ambiguous environment, and Dr. Saul's research sheds light on malicious and criminal activity that covertly transpires on the web.  To help address this problem, Dr. Saul's research at the University of California, San Diego, reveals and helps eliminate underground "malicious actors" who exploit innocent click-goers in a multitude of ways.

  • Dr. Saul has explored how intelligent algorithms can automatically identify malicious content on the Web.  For example, he has built systems that identify malicious web pages based on clues from their textual content, structural tags, page links, visual appearance, and URLs.  
  • He has also studied the listings for abuse-related jobs that appear on freelancing Web sites such as Mechanical Turk and Freelancer.com.  These jobs are outsourced to low-cost laborers who work for cents on the dollar to spam or engage in deceptive practices not supported by law.
  • In recent work, he has developed an automatic system for the large-scale monitoring of online storefronts for spam-advertised goods. The system was developed from an extensive crawl of black-market web sites that deal in illegal pharmaceuticals, replica luxury goods, and counterfeit software. The operational goal of the system is to identify the affiliate programs of online merchants behind these web sites; the system itself is part of a larger effort to improve the tracking and targeting of these affiliate programs.

Threats to computer security affect everyone on the Internet.  Dr. Saul's work has shown that advances in machine learning and pattern recognition provide the tools for faster detection and assessment of these threats.  "malicious actors" install harmful files, extract confidential information, and profiteer by unlawful manners, all usually beginning with a simple click. Help Dr. Saul to study these illicit activities so that we can better develop strategies to overcome them.


Lawrence Saul, Ph.D., is a Professor of Computer Science and Engineering at University of California, San Diego's Jacobs School of Engineering.  His current research focuses on applications of machine learning to problems in computer systems and security.  He also works in the areas of high dimensional data analysis, probabilistic graphical modeling, kernel methods, distance metric learning, and speech and audio processing.

Dr. Saul earned his B.A. in physics from Harvard and his Ph.D. in Physics from MIT in 1994. In 1999, after working three years in the speech center at AT&T Labs, he was recognized by the MIT-based journal Technology Review as one of 100 top innovators under the age of thirty-five. A recent bibliographic analysis by Essential Science Indicators indicated that his work had entered the top 1% of total citations earned in the field of Computer Science.

Dr. Saul was a founding member of the Editorial Board for the Journal of Machine Learning Research, and he served as Editor-In-Chief of the journal from 2008 to 2012.

Dr. Saul is currently a member of the Artificial Intelligence Group in the Department of Computer Science and Engineering at University of California, San Diego.  He joined the department in 2006.  Before that, he was a member of the faculty in the Department of Computer and Information Science at the University of Pennsylvania.

Dr. Saul is best known for developing powerful, new algorithms for revealing low dimensional structure in high dimensional data.  His work on high dimensional data analysis has had applications in many areas of science and engineering.


A Gaussian latent variable model for large margin classification of labeled and unlabeled data


A variational approximation for topic modeling of hierarchical corpora


Latent coincidence analysis: a hidden variable model for distance metric learning


Latent variable models for predicting file dependencies in large-scale software development


Distance Metric Learning for Large Margin Nearest Neighbor Classification


Identifying Suspicious URLs: An Application of Large-Scale Online Learning