Using computer algorithms to understand and process natural languages

Dr. Noah Smith's fascination with computers began at an early age, but his first attempt to write code dealing with language stemmed from his seventh grade homework. Intrigued by the regularities and irregularities of French, Smith wrote a program that would conjugate the verbs for him. This early excitement about computer code and what makes human languages easy or difficult to learn has grown into a successful career that merges Linguistics and Computer Science. Early research opportunities as an undergraduate student revealed the applications of this connection, providing hands-on opportunities for Smith to explore and think like a scientist.

Today, Smith's research program at Carnegie Mellon University develops algorithms that process language data in order to extract information and make useful inferences. Vast amounts of text are produced every day as a by-product of every human pursuit: science, finance, government, and social communication. The ability for computers to "read" this data and offer human-interpretable analysis of what is going on in the world depends critically on natural language processing.

  • Though we all use natural language every day, the seemingly simple task of decoding a piece of text to find the meaning that its author intended is a challenging one. Smith considers this problem from different angles, such as: parsing sentences into simpler syntactic representations (not unlike sentence diagramming (subject/predicate) that many of us were required to do by hand in grade school), interpreting those sentences based on various semantic theories, and identifying the choices an author has made in framing the matters being discussed. This research builds on theoretical foundations in Computer Science, Linguistics, and Statistics, marrying theories of computation and language with a data-driven mentality.

  • Adding to the inherent challenges of natural language is the fact that the data are changing fast. More and more of our communication happens on the Internet through emails, blogs, and social media sites. The rise of platforms like Twitter and Facebook offers an unprecedented view of how a huge part of the population really uses language: unedited, with unconventional spellings, infusion of dialect, colloquialism, and blends of different languages and styles. Some of the algorithms Smith has contributed in the past few years are specially designed to cope with the challenges of casual communication on the Internet using robust machine learning on very large datasets. In addition to "wrangling" this wild data, Smith has performed statistical analyses to understand how language varies geographically and changes over time. A recent study revealed the importance of demographics, in addition to geographic closeness and city size, in the spread of new words between population centers in the United States.

  • In developing text processing algorithms, Smith works closely with collaborators in the social sciences. One example is a collaboration with political scientists to automate the measurement of ideological content of political speeches. The team found that presidential candidates' language in speeches shows clear movement toward the center (a more moderate political stance) after winning a primary election. Other projects have explored the political nature of censorship in Chinese social media, quantified the information content in financial disclosures before and after Sarbanes-Oxley reform, and predicted the success of bills in Congressional committees.

Research developments like Smith's are also central to advances in recognizable technologies. Automatic translation between languages is becoming more useful and reliable for more languages than ever before. Computer programs that answer questions are becoming an everyday presence, not just in feats to awe the public by winning Jeopardy, like Watson, but to help navigate daily tasks, like Siri. In the next few years, anyone who faces the challenge of making sense from a heap of messy, unstructured text will have the opportunity to use new tools that improve understanding and save time. Smith and his team are active contributors to these technologies.

Bio

Before completing his graduate studies at Johns Hopkins University, Dr. Noah Smith received two bachelor's degrees, a B.S. in Computer Science and a B.A. in Linguistics.

Dr. Smith uses this dual training to not only write computer code and create technology that decodes text into its intended meaning, but also to contribute to the analysis of these findings.

During his undergraduate study, Dr. Smith was a recipient of the Banneker/Key Scholarship. He was also awarded a Fannie and John Hertz Foundation Fellowship for his graduate studies.

Outside of the laboratory, Dr. Smith enjoys listening to and playing classical music, swimming, and dancing the Argentine tango with his wife, Dr. Karen Thickman.

In the News

What Twitter Says to Linguists

For researchers studying the use of language in today's networked world, social media is an invaluable tool

Do You Have a Twitter 'Accent'?

With over 225 million active users sharing over 500 million tweets a day, the social network site is changing the way we communicate

Revealed: How China censors its social networks

The way the Chinese government censors and deletes politically-sensitive terms online is revealed to work in real-time and can adapt quickly to emerging issues

Twitterology: A New Science?

Twitter is proving to be a "gold mine" for scholars in fields such as linguistics, sociology and psychology

Twitter a decent stand-in for public opinion polls

Sentiments expressed through daily tweets strongly correlate with well-established public opinion polls

Publications

Weakly-Supervised Bayesian Learning of a CCG Supertagger

We present a Bayesian formulation for weakly-supervised learning of a Combinatory Categorical Grammar (CCG) supertagger with an HMM

PDF

A Bayesian Mixed Effects Model of Literary Character

We consider the problem of automatically inferring latent character types in a collection of 15,099 English novels published between 1700 and 1899

PDF

A Sparse and Adaptive Prior for Time-Dependent Model Parameters

We consider the scenario where the parameters of a probabilistic model are expected to vary over time

PDF

CMU: Arc-Factored, Discriminative Semantic Dependency Parsing

We present an arc-factored statistical model for semantic dependency parsing

PDF

A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements

With the rapid development of web-based services, concerns about user privacy have heightened

PDF

Phrase Dependency Machine Translation with Quasi-Synchronous Tree-to-Tree Features

Recent research has shown clear improvement in translation quality by exploiting linguistic syntax for either the source or target language

PDF

Linguistic Structured Sparsity in Text Categorization

We introduce three liguistically motivated structured regularizers based on parse trees, topics, and hierarchical work clusters for text categorization.

PDF

Simplified Dependency Annotations with GFL-Web

We present GFL-Web, a web based interface for syntactic dependency annotation with the lightweight FUDG/GFL formalism

PDF

Awards

2013 Five-year Retrospective Best Paper

Presented by the Workshop on Machine Translation

2011 National Science Foundation CAREER Award

Presented by the National Science Foundation

2010 SAS/International Institute of Forecasters Grant

Presented by the International Institute of Forecasters

2009 Best Conference Paper

Presented by the Proceedings of the Joint Conference of the Annual Meeting of the Association of Computational Linguistics and the International Joint Conference on Natural Language Processing

2008 Best Student Paper

Presented by the Proceeding of the International Conference on Logic Programming

Fannie and John Hertz Graduate Student Fellowship

Presented by Fannie and John Hertz Foundation, Awarded 2001-2006