The Penn Treebank Tagset

22.12.2020 | Processing/POS Tagging/Tag Sets

Contents/Index

@The Penn Treebank Tagset

The Penn Treebank Part-of-Speech tagset is as given in this table

Tag	Description	Example
CC	Coordination conjunction	and,but,or
CD	Cardinal number	one,two
DT	Determiner	a,the
EX	Existential 'there'	there
FW	Foreign word	mea culpa
IN	Preposition /subordin conjunction	of,in,by
JJ	Adjective	tall
JJR	Comparative adjective	smaller
JJS	Superlative adjective	nicest
LS	List marker	1)
MD	Model	could,will
NN	Noun, singular or mass	table
NNS	Noun plural	cars
NP	Proper noun, singular	Martin
NPS	Proper noun, plural	Vikings
PDT	Predeterminer	Both the girls
POS	Possessive ending	friend's
PP	Personal pronoun	I, he, it
PPZ	Possessive pronoun	my, his
RB	Adverb	however, usually, naturally, here, good
RBR	Adverb comparative	better
RBS	Adverb superlative	best
RP	Particle	give up
SENT	Sentence-break punctuation	.!?
SYM	Symbol	/[=*
TO	Infinite "to"	togo
UH	Interjection	uhhuhhuhh
VB	Verb be, base form	be
VBD	Verb be, past tense	was, were
VBG	Verb be, gerund/present participle	been
VBN	Verb be, past participle	been
VBZ	Verb be,third person sing. present	is
VH	Verb have, base form	have
VHD	Verb have, past tense	had
VHG	Verb have, gerund/present participle	having
VHN	Verb have, past participle	had
VHP	Verb have, sing. present, non-3d	have
VHZ	Verb have, third person sing. present	has
VV	Verb, base form	take
VVD	Verb, past tense	took
VVG	Verb, gerundt/present participle	taking
VVN	Verb, past participle	taken
VVP	Verb, sing. present, non-3d	take
VVZ	Verb, 3rd person sing. present	takes
WDT	Wh-determiner	which
WP	Wh-pronoun	who, what
WP$	Possessive wh-pronoun	whose
WRB	Wh-abverb	where, when
#	#	#
$	$	$
"	Quotation marks	'"
``	Opening quotation marks	'"
(	Opening bracket	({
)	Closing bracket	})
,	Comma	,
:	Punctuation	-;:...

This tag set is used by the nltk.pos_tag() method. As illustrated:

import re import nltk # download a tagger nltk.download('averaged_perceptron_tagger') # define some sentence sent1 = "Time flies like an arrow, but fruit flies like a banana." # define some regex for tokenization rex1 = "[A-Za-z0-9]+|[,.]" # tokenize with regular expression sent1_tok = re.findall(rex1,sent1) # print the resulting pos tags print(nltk.pos_tag(sent1_tok))

Resulting in

[('Time', 'NNP'), ('flies', 'NNS'), ('like', 'IN'), ('an', 'DT'), ('arrow', 'NN'), (',', ','), ('but', 'CC'), ('fruit', 'JJ'), ('flies', 'NNS'), ('like', 'IN'), ('a', 'DT'), ('banana', 'NN'), ('.', '.')]