Data Science Tea: MS Industry Mentorship Student

08 May
Monday, 05/08/2017 4:00pm to 5:00pm
Computer Science Building Room 150/151
Data Science Tea

Please join us for MS Industry Mentorship Student Presentations at Data Science Tea!

There will be four presentations from the inaugural UMass Amherst Data Science MS Student Independent Study Project Project Program with Industry Mentors organized by Professor Andrew McCallum.


      Presentations:

Metadata extraction from research publications
      Molly McMahon, Sheshera Mysore, Akul Siddalingaswamy, Aditya Narasimha Shastry
      Mentor: Meta/Chan Zuckerberg Initiative - Ofer Shai, Shankar Vembu

      Abstract: In this project, we'll examine the use of advanced
      techniques to automate and improve the very first step of the
      manuscript processing pipeline. The task involves identifying and
      extracting elements from the PDF file including title, abstract,
      authors, and affiliations. Unlike published articles, PDF
      manuscripts are provided with no consistent formatting, added
      headers for editors and reviewers, line numbers, and other noisy
      elements. Cleanly labeling the metadata is crucial for proper
      downstream processing. We will leverage existing PDF parsing
      technology that provides detailed information about the text and
      formatting of the manuscript. We will be experimenting with
      Logistic Regression, Feed-forward Neural Networks, and
      Bidirectional LSTMs.

Concept/Theme Roll Up
      Tanvi Sahay (Presenter). Ramteja Tadishetti, Ankita Mehta, Shruti Jadon
      Mentor: Lexalytics Inc. - Paul Barba, Al Hough and Brian Pinette

      Abstract: The main ideas in any set of sentences can be
      represented as a bunch of key phrases that provide information
      regarding the theme(s) of those sentences. One downstream
      application of interest with these sets of themes is to roll up
      similar themes together so a user can query all phrases that
      belong to a particular theme without caring about other
      information not of direct interest to her. This is particularly
      useful in the domain of hotel review where a user may be
      interested in the location of the hotel more than the type of
      food they serve and thus only wants to see reviews about the
      location. We try to solve this problem by preparing a distributed
      representation of the phrases (for which various methods have
      been experimented with) and cluster similar phrases together
      using this representation.


Multilingual Embeddings using ACS for Cross-Lingual NLP
      Nitin Kishore, Daniel Sam Pete Thiyagu, Shamya Karumbaiah
      Mentor: Oracle - Michael Wick and Pallika Kanani

      Abstract: Oracle is an multinational corporation that develops
      products and builds tools in many different languages. An
      important practical problem is to make natural language
      processing (NLP) tools (document classification, named entity
      recognition, etc.) available in every such language.
      Traditionally, an NLP practitioner would collect training data in
      every language for every task for every domain, but such data
      collection is expensive and time-consuming. Further, many
      resources available in a language such as English are not
      available in languages with fewer speakers. In this project, we
      want to explore a solution to multilingual NLP that does not
      exclusively require labeling so much data. In particular, we
      would like to harness unlabeled multilingual data to learn a
      common representation under which structure is shared across
      different languages. For example, in such a space, the vector for
      the English word "good" is close to the vector for the French
      word "bon." Then, by employing Artificial Code switching  and
      using the multilingual representation as features, we can train a
      classifier in one language and have it generalize to other
      languages, without much additional labeled data. In this project we

       (1) explore how to learn a good multilingual representation
       (2) study how the number and class of languages affect the
             quality of the multilingual embedding space, and
       (3) study how well the multilingual representations allow us to
             transfer NLP models across different languages.

 

Career Path Analysis With Topical Sequence Models
      Dan Saunders, Ananya Suraj (Presenters). Kartik Chhapia. Suraj Subraveti
      Mentor: Center for Data Science - Matt Rattigan

    
      Abstract: Previous efforts (Mimno and McCallum, 2008) have demonstrated the
      usefulness of topic models for understanding the dynamics of the
      job market. Using a corpus of resumes as training data, we can
      build a topic model which captures the important facets of a
      resume, where the "topics" are distributions over words
      typically found in job descriptions. We can use these topics to
      construct a "topical sequence" model to predict job
      transitions for individuals over time. The goal of this project
      is to build on the previous work in this area and expand its
      scope to better understand workforce characteristics more
      generally. Example questions that we try to answer are: What is
      the next role for a particular person given his resume? What
      types of roles have the most variability in terms of career path?