HowTo-AzureML-create n-gram features using Feature Hashing for text data

Govind Kanshi
1 min readAug 11, 2014

AzureML has famous Vowpal-Wabbit’s Hashing trick embedded in it. It allows to use low cost, low impact hashing of the features give hashing bit-size and # of n-grams.

Steps

  1. Upload the Text file or read it. I just took bunch of text from a news site and loaded it as text file.
  2. Use the Feature hashing module — specify hashing bit size and n-grams.

Output is the Features as columns. (visualized as below). Mostly this will be sparse.

What is n-gram — N-grams are contiguous sequences of n items from a given sequence. Given “I like Star trek movie” — 2-gram output would be — I like, like star, star trek … I could not find a way to “print” those n-gram columns “values” in AzureML yet. Vowpal wabbit has the “ -invert_hash” option to print them out.

Background —

http://alex.smola.org/papers/2009/Weinbergeretal09.pdf

http://www.cse.wustl.edu/~kilian/papers/ceas2009-paper-11.pdf

--

--

Govind Kanshi

I help create reliable, pragmatic software solutions using the dainty words like Cloud and Data. I work at Azure Cosmos DB team.