Here we use Ulysses by James Joyce as corpus for training. The story is public domain. Download it from ulysse.txt.
We can tokenize the text using the following code:import re import nltk # standard regex for english # no special chars like á re_words = "[a-zA-Z0-9']+" def ulysses(with_tag): # returns a list of words/tokens retval =  with open("train_data/ulysses.txt") as f: retval = re.findall(re_words,f.read()) if with_tag: # if with_tag, then add pos-tags return [[x.lower(),nltk.pos_tag([x])] for x in retval] else: return [x.lower() for x in retval]
Save it in a file called data_loader.py. nltk has been used to obtain pos-tags to the tokens. Note that if we do not put x in a list, pos_tag will tag each char of the word.