NLP中的transform:spacy-transformers
PyDictionary可以找打词的同义词和翻译
1 | from PyDictionary import PyDictionary |
vocabulary 提供翻译同义词,meaning等功能。
spacy 模型的地址:https://github.com/explosion/spacy-models,每个模型详细介绍https://spacy.io/usage/models
下载模型pip install -U spacypython -m spacy download en_core_web_sm.
基本操作
1 | import spacy |
1 | import re |
pos的缩写代表的意思
| POS Tag | Description | Example | |
|---|---|---|---|
| 0 | CC | coordinating conjunction | and |
| 1 | CD | cardinal number | 1, third |
| 2 | DT | determiner | the |
| 3 | EX | existential there | there, is |
| 4 | FW | foreign word | d’hoevre |
| 5 | IN | preposition or subordinating conjunction | in, of, like |
| 6 | JJ | adjective | big |
| 7 | JJR | adjective, comparative | bigger |
| 8 | JJS | adjective, superlative | biggest |
| 9 | LS | list marker | 1) |
| 10 | MD | modal | could, will |
| 11 | NN | noun, singular or mass | door |
| 12 | NNS | noun plural | doors |
| 13 | NNP | proper noun, singular | John |
| 14 | NNPS | proper noun, plural | Vikings |
| 15 | PDT | predeterminer | both the boys |
| 16 | POS | possessive ending | friend‘s |
| 17 | PRP | personal pronoun | I, he, it |
| 18 | PRP$ | possessive pronoun | my, his |
| 19 | RB | adverb | however, usually, naturally, here, good |
| 20 | RBR | adverb, comparative | better |
| 21 | RBS | adverb, superlative | best |
| 22 | RP | particle | give up |
| 23 | TO | to | to go, to him |
| 24 | UH | interjection | uhhuhhuhh |
| 25 | VB | verb, base form | take |
| 26 | VBD | verb, past tense | took |
| 27 | VBG | verb, gerund or present participle | taking |
| 28 | VBN | verb, past participle | taken |
| 29 | VBP | verb, sing. present, non-3d | take |
| 30 | VBZ | verb, 3rd person sing. present | takes |
| 31 | WDT | wh-determiner | which |
| 32 | WP | wh-pronoun | who, what |
| 33 | WP$ | possessive wh-pronoun | whose |
| 34 | WRB | wh-abverb | where, when |
对应的表格https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
形态morphology :就是词语的词根不会变化,而多了前缀和后缀修饰。比方说一个动词有过去式,现在进行时和过去完成时,这些都认为他有一个词根。
每个句子(nlp对象)包含单词,单词由是morphology对象
1 | from spacy.morphology import Morphology |
kindred
这是一个专门处理医学文献的NLP库。
kindred提供根据pmid下载文章的函数kindred.pubtator.load(pmids),但是这个很不好用,会报错RuntimeError: Unable to download PubTator data after 3 retries,就是pubmed
textacy
数据预处理模块
1 | from textacy import preprocessing |
主要的部分有: information extraction,text statistics,document similarity,data augmentation.
对extract部分比较感兴趣说一下这里的API
1 | textacy.extract.triples.subject_verb_object_triples(doc)#可以找到句子中的主谓宾语 |
主题模型是提取文本,计算抽象主题相似性的一种统计学模型。用了sklearn的 LSA, LDA, and NMF models 三个模型并将他们用到矢量化文本。