Ulysses

08.01.2021 | Processing/Corpus

Contents/Index

Here we use Ulysses by James Joyce as corpus for training. The story is public domain. Download it from ulysse.txt.

We can tokenize the text using the following code:

import re import nltk # standard regex for english # no special chars like á re_words = "[a-zA-Z0-9']+" def ulysses(with_tag): # returns a list of words/tokens retval = [] with open("train_data/ulysses.txt") as f: retval = re.findall(re_words,f.read()) if with_tag: # if with_tag, then add pos-tags return [[x.lower(),nltk.pos_tag([x])[0][1]] for x in retval] else: return [x.lower() for x in retval]

Save it in a file called data_loader.py. nltk has been used to obtain pos-tags to the tokens. Note that if we do not put x in a list, pos_tag will tag each char of the word.