Advancing machine learning of natural language enables computers to understand language and the world through text, while improving their interaction with humans

Human knowledge is largely encoded in natural language. A longstanding goal of artificial intelligence has been to automate the understanding of natural language. However, creating an appropriate representation for the meaning of language is problematic. Approaches using complex semantic representations are difficult to scale because they try to cover the broad range of expressions used in actual language. Dr. Chris Callison-Burch, Aravind K Joshi term Assistant Professor in Computer and Information Science Department at the University of Pennsylvania, is interested in computer understanding of human language, specifically computational linguistics and natural language processing. His innovative work enables computers to understand language, learn about the world through text, and improve their interaction with humans. Dr. Callison-Burch’s research has the potential to unlock the huge volume of human knowledge encoded in text form and make it analyzable by computers. His approach uses pairs of English phrases as the basic unit of representation, which are then automatically labeled with a small number of semantic relationships that allow a subset of automated reasoning to be applied. This design decision enables Dr. Callison-Burch to scale to open domains and implement data-driven algorithms for acquiring semantic knowledge about language. The results of his work can help researchers in multiple research fields, such as sociology, medicine, public health, and political science.

Dr. Callison-Burch and his team of computer science students closely collaborate with professors and researchers in the fields of epidemiology, political science, and linguistics, to name a few. Their current research is largely focused on understanding the causes and effects gun violence in the U.S. The U.S. Congress has blocked major organizations from collecting data about gun violence. This hinders epidemiological research that would allow academics and policy makers to understand the risk factors and possible remedies for gun deaths. Dr. Callison-Burch and his team—in collaboration with an epidemiology professor at the University of Pennsylvania—combine crowdsourcing and natural language processing for their public health studies of gun violence. They created a website that catalogues all incidents of gun violence reported online. This database uses machine learning, natural language processing, and volunteer labor from individuals to create a structured database of gun violence. They use an artificial intelligence algorithm to help them classify information about each incident (fatal or nonfatal, location, number of victims, demographic information of shooter, etc.). This concrete database uses computer-aided understanding of language to facilitate the research of scientists who want to understand these gun violence incidents.

Current Research Includes:

Using Crowdsourcing to Explore New Areas of Natural Language Processing — Natural language processing is a very machine learning-oriented field, which is based on having large amounts of data that has been categorized or annotated by humans. Dr. Callison-Burch and his team are creating new datasets using crowdsourcing platforms, such as Amazon Mechanical Turk. Instead of using experts to annotate data—which is an expensive procedure—he and his team are reframing the data creation process of machine learning through micro tasks performed by anyone on the internet. Dr. Callison-Burch and his team use crowdsourcing for their gun violence project to create the structure database. The diverse multilingual workforce that creates data enables them to train machines to learn new languages.

Automate the Understanding of English Through Paraphrases — Because the same ideas can be expressed in a wide variety of ways, trying to make computers understand language through a certain set of rules becomes problematic. Dr. Callison-Burch and his team are adapting the data, representations, and algorithms from statistical machine translation for natural language processing. Through paraphrasing, computers can automatically learn a wide variety of expressions to facilitate tasks such as conversational agents — think Amazon’s Alexa or Apple’s Siri. His work can also solve problems of extracting information from the web, which he does as part of his gun violence site. Further, Dr. Callison-Burch has an algorithm that automatically learns paraphrases, which are equivalent expressions from data. He created a paraphrase database that has more than 169 million paraphrase transformation rules, which works by populating the paraphrases of a word that is typed in.

Extend Machine Translation — Machine translation has been one of the biggest successes of artificial intelligence and machine learning over the past 20 years. It’s transitioned from something of science fiction to fact, resulting in services like Google Translate. It works well, but has imperfections. For instance, it only accommodates about 70 languages, though around 3,000 languages exist. Dr. Callison-Burch and his team are using computer learning translations from monolingual texts in two languages, so that it may be applied to a wider range of languages. In order to accomplish this, they teach the artificial intelligence machine how to translate by reading a variety of examples. For example, for French to English translations, the machine will look at a million sentences in French paired with their human translations into English. For the remaining untranslated languages, Dr. Callison-Burch and his team are looking into the next generation of algorithms for machine translation to see if those other languages can be learned from other types of data. If they have a huge history of articles in a certain language, but they don’t have the same for English during an equivalent period, Dr. Callison-Burch and his team use translation equivalence—the frequency of a word across time—as well as visual similarity. Their ultimate goal is to learn translations for languages that are not currently covered in tools like Google Translate, so that they can extend machine translations in new languages. 

Dr. Chris Callison-Burch has always been fascinated by human language. Specifically, he has been interested in the innate ability of humans to communicate with one another, and the way language allows us to encode information and produce art. His passion for computer science and linguistics started when he was an undergraduate at Stanford University, where he earned an interdisciplinary degree combining Linguistics, Computer Science, Psychology and Philosophy. His ultimate goal is to allow computers to understand language with the same capacity that humans do.

Dr. Callison-Burch’s current approach to language understanding is inspired by his past research in machine translation. The advent of data-driven, statistical models has resulted in dramatically improved quality for machine translation. Commercial systems like Google Translate, or state-of-the-art research software that he helped develop use pairs of English and foreign phrases as their underlying representation. These phrase pairs are automatically acquired from a large volume of translated documents, and are treated as meaning-equivalent without having an explicit semantic representation. Vast quantities of bilingual training data enables a huge number of phrase pairs to be extracted, and to estimate associated probabilities. He assembled the largest publicly available bilingual training data for statistical machine translation, consisting of 22 million sentence pairs with 1.5 billion French + English words. This encompasses a huge range of language use from scientific abstracts to movie dialog slang, and thus allows the system to translate a wide variety of input sentences.

In addition to his research, Dr. Callison-Burch also works to promote women in computer science. As a faculty member at Johns Hopkins University, he was the chair of the diversity committee for the Computer Science department. He helped to start a Women in Computer Science (WiCS) group. The goals of WiCS were to foster a sense of community and to improve retention of women in our undergraduate program, offering undergraduates mentorship from female graduate students and from faculty. He has and continues to mentor PhD students, postdocs, and visiting scholars, many of whom are women. He hopes to continue improving the gender diversity of the computer science department here, and feels that the best way of doing so it to engage women in research. You can read his full research statement here.

Sloan Fellow, 2014