Using computer algorithms to understand and process natural languages

Dr. Noah Smith's fascination with computers began at an early age, but his first attempt to write code dealing with language stemmed from his seventh grade homework. Intrigued by the regularities and irregularities of French, Smith wrote a program that would conjugate the verbs for him. This early excitement about computer code and what makes human languages easy or difficult to learn has grown into a successful career that merges Linguistics and Computer Science. Early research opportunities as an undergraduate student revealed the applications of this connection, providing hands-on opportunities for Smith to explore and think like a scientist.

Today, Smith's research program at Carnegie Mellon University develops algorithms that process language data in order to extract information and make useful inferences. Vast amounts of text are produced every day as a by-product of every human pursuit: science, finance, government, and social communication. The ability for computers to "read" this data and offer human-interpretable analysis of what is going on in the world depends critically on natural language processing.

  • Though we all use natural language every day, the seemingly simple task of decoding a piece of text to find the meaning that its author intended is a challenging one. Smith considers this problem from different angles, such as: parsing sentences into simpler syntactic representations (not unlike sentence diagramming (subject/predicate) that many of us were required to do by hand in grade school), interpreting those sentences based on various semantic theories, and identifying the choices an author has made in framing the matters being discussed. This research builds on theoretical foundations in Computer Science, Linguistics, and Statistics, marrying theories of computation and language with a data-driven mentality.

  • Adding to the inherent challenges of natural language is the fact that the data are changing fast. More and more of our communication happens on the Internet through emails, blogs, and social media sites. The rise of platforms like Twitter and Facebook offers an unprecedented view of how a huge part of the population really uses language: unedited, with unconventional spellings, infusion of dialect, colloquialism, and blends of different languages and styles. Some of the algorithms Smith has contributed in the past few years are specially designed to cope with the challenges of casual communication on the Internet using robust machine learning on very large datasets. In addition to "wrangling" this wild data, Smith has performed statistical analyses to understand how language varies geographically and changes over time. A recent study revealed the importance of demographics, in addition to geographic closeness and city size, in the spread of new words between population centers in the United States.

  • In developing text processing algorithms, Smith works closely with collaborators in the social sciences. One example is a collaboration with political scientists to automate the measurement of ideological content of political speeches. The team found that presidential candidates' language in speeches shows clear movement toward the center (a more moderate political stance) after winning a primary election. Other projects have explored the political nature of censorship in Chinese social media, quantified the information content in financial disclosures before and after Sarbanes-Oxley reform, and predicted the success of bills in Congressional committees.

Research developments like Smith's are also central to advances in recognizable technologies. Automatic translation between languages is becoming more useful and reliable for more languages than ever before. Computer programs that answer questions are becoming an everyday presence, not just in feats to awe the public by winning Jeopardy, like Watson, but to help navigate daily tasks, like Siri. In the next few years, anyone who faces the challenge of making sense from a heap of messy, unstructured text will have the opportunity to use new tools that improve understanding and save time. Smith and his team are active contributors to these technologies.

Before completing his graduate studies at Johns Hopkins University, Dr. Noah Smith received two bachelor's degrees, a B.S. in Computer Science and a B.A. in Linguistics.

Dr. Smith uses this dual training to not only write computer code and create technology that decodes text into its intended meaning, but also to contribute to the analysis of these findings.

During his undergraduate study, Dr. Smith was a recipient of the Banneker/Key Scholarship. He was also awarded a Fannie and John Hertz Foundation Fellowship for his graduate studies.

Outside of the laboratory, Dr. Smith enjoys listening to and playing classical music, swimming, and dancing the Argentine tango with his wife, Dr. Karen Thickman.

For researchers studying the use of language in today's networked world, social media is an invaluable tool

With over 225 million active users sharing over 500 million tweets a day, the social network site is changing the way we communicate

The way the Chinese government censors and deletes politically-sensitive terms online is revealed to work in real-time and can adapt quickly to emerging issues

Twitter is proving to be a "gold mine" for scholars in fields such as linguistics, sociology and psychology

Sentiments expressed through daily tweets strongly correlate with well-established public opinion polls

2013 Five-year Retrospective Best Paper

Presented by the Workshop on Machine Translation

2011 National Science Foundation CAREER Award

Presented by the National Science Foundation

2010 SAS/International Institute of Forecasters Grant

Presented by the International Institute of Forecasters

2009 Best Conference Paper

Presented by the Proceedings of the Joint Conference of the Annual Meeting of the Association of Computational Linguistics and the International Joint Conference on Natural Language Processing

2008 Best Student Paper

Presented by the Proceeding of the International Conference on Logic Programming

Fannie and John Hertz Graduate Student Fellowship

Presented by Fannie and John Hertz Foundation, Awarded 2001-2006