Here we use the The Tell-Tale Heart by Edgar Allan Poe as corpus for training. The story is public domain. Download it from telltale.txt. Locate it in a folder called train_data.
We can tokenize the text using the following code:
import re import nltk # standard regex for english # no special chars like รก re_words = "[a-zA-Z0-9']+" def tell_tale(with_tag): # returns a list of words/tokens retval = [] with open("train_data/telltale.txt") as f: retval = re.findall(re_words,f.read()) if with_tag: # if with_tag, then add pos-tags return [[x.lower(),nltk.pos_tag([x])[0][1]] for x in retval] else: return [x.lower() for x in retval]Save it in a file called data_loader.py. nltk has been used to obtain pos-tags to the tokens. Note that if we do not put x in a list, pos_tag will tag each char of the word.