{"_id":"584aeea89588370f00608a74","sync_unique":"","hidden":false,"isReference":false,"title":"Tokenization and POS Tagging","excerpt":"","type":"basic","category":"584aeea89588370f00608a3f","project":"559ae8ec7ae7f80d0096d813","next":{"pages":[],"description":""},"link_external":false,"body":"One of the very first steps necessary in processing text is to break the text apart into tokens and to group those tokens into sentences. We use the word \"tokens\" and not \"words\" because tokens can also be things like:\n\n* punctuation (exclamation points affect sentiment, for instance)\n* links (http://...)\n* possessive markers\n* and the like.\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Language Approach\"\n}\n[/block]\nFor most European languages, tokenization is fairly straightforward - look for white space, look for punctuation, and the like. Each language has its own features, though - for instance, German makes extensive use of compound words and for some purposes such as sentiment it can be worth it to tokenize to the sub-word level. \n\nSome languages, such as Chinese, have no space breaks between words and tokenizing those languages requires the use of more sophisticated statistical models. Lexalytics has developed tokenization models for all of our supported languages.\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Parts of Speech\"\n}\n[/block]\nMost NLP and text mining tools make use not just of a bucket of tokens but also the parts of speech. Knowing what part of speech a token is makes it more useful. Proper nouns (Lexalytics) are more likely to be a mention of person, place, or company, adjectives (terrible) are more likely to be sentiment phrases, and so on. In most languages, single words can be of multiple speech types depending on context - \"Love makes the world go round\" has \"love\" as a noun, while \"I love NLP\" has love as a verb. Determining the part of speech for a token requires evaluating the context the word appears in. \n\nLexalytics has developed POS tagging models for most of its supported languages, and returns POS tags along with the text output if desired. Our set of POS tags is an extension of the Penn Treebank set of POS tags.","parentDoc":null,"user":"559ae88c7ae7f80d0096d812","version":"584aeea89588370f00608a3b","updates":[],"slug":"feature-1","__v":0,"createdAt":"2015-07-07T21:33:32.966Z","link_url":"","githubsync":"","api":{"params":[],"url":"","results":{"codes":[]},"settings":"","auth":"required"},"order":10,"childrenPages":[]}

Tokenization and POS Tagging


One of the very first steps necessary in processing text is to break the text apart into tokens and to group those tokens into sentences. We use the word "tokens" and not "words" because tokens can also be things like: * punctuation (exclamation points affect sentiment, for instance) * links (http://...) * possessive markers * and the like. [block:api-header] { "type": "basic", "title": "Language Approach" } [/block] For most European languages, tokenization is fairly straightforward - look for white space, look for punctuation, and the like. Each language has its own features, though - for instance, German makes extensive use of compound words and for some purposes such as sentiment it can be worth it to tokenize to the sub-word level. Some languages, such as Chinese, have no space breaks between words and tokenizing those languages requires the use of more sophisticated statistical models. Lexalytics has developed tokenization models for all of our supported languages. [block:api-header] { "type": "basic", "title": "Parts of Speech" } [/block] Most NLP and text mining tools make use not just of a bucket of tokens but also the parts of speech. Knowing what part of speech a token is makes it more useful. Proper nouns (Lexalytics) are more likely to be a mention of person, place, or company, adjectives (terrible) are more likely to be sentiment phrases, and so on. In most languages, single words can be of multiple speech types depending on context - "Love makes the world go round" has "love" as a noun, while "I love NLP" has love as a verb. Determining the part of speech for a token requires evaluating the context the word appears in. Lexalytics has developed POS tagging models for most of its supported languages, and returns POS tags along with the text output if desired. Our set of POS tags is an extension of the Penn Treebank set of POS tags.