08.01.2021 | Processing/Corpus


The Tell-Tale Heart

Here we use Ulysses by James Joyce as corpus for training. The story is public domain. Download it from ulysse.txt.

We can tokenize the text using the following code:

import re import nltk # standard regex for english # no special chars like รก re_words = "[a-zA-Z0-9']+" def ulysses(with_tag): # returns a list of words/tokens retval = [] with open("train_data/ulysses.txt") as f: retval = re.findall(re_words, if with_tag: # if with_tag, then add pos-tags return [[x.lower(),nltk.pos_tag([x])[0][1]] for x in retval] else: return [x.lower() for x in retval]

Save it in a file called nltk has been used to obtain pos-tags to the tokens. Note that if we do not put x in a list, pos_tag will tag each char of the word.

