HowTo-AzureML-create n-gram features using Feature Hashing for text data
AzureML has famous Vowpal-Wabbit’s Hashing trick embedded in it. It allows to use low cost, low impact hashing of the features give hashing bit-size and # of n-grams.
Steps
- Upload the Text file or read it. I just took bunch of text from a news site and loaded it as text file.
- Use the Feature hashing module — specify hashing bit size and n-grams.
Output is the Features as columns. (visualized as below). Mostly this will be sparse.
What is n-gram — N-grams are contiguous sequences of n items from a given sequence. Given “I like Star trek movie” — 2-gram output would be — I like, like star, star trek … I could not find a way to “print” those n-gram columns “values” in AzureML yet. Vowpal wabbit has the “ -invert_hash” option to print them out.
Background —