TTIC
Toyota Technological Institute at Chicago  

Guy Lebanon

Purdue University - Machine Learning Reading Group

Sequential Document Representations and Simplicial Curves

May 19, 2006 12:00pm

Abstract:

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present a continuous and differentiable sequential document representation that goes beyond the bag of words assumption, and yet is efficient and effective. This representation employs smooth curves in the multinomial simplex to account for sequential information. In contrast to n-grams the new representation is able to robustly model long-rage sequential trends in the paper. We discuss the representation and its geometric properties and demonstrate its applicability for the task of text classification.

If you have questions, or would like to meet the speaker, please contact Ponda at 4-1994 or pondabarnes@tti-c.org. For information on future TTI-C talks or events, please go to the TTI-C Events page.



return to events page