NLP中的transform:spacy-transformers
PyDictionary可以找打词的同义词和翻译
1 | from PyDictionary import PyDictionary |
vocabulary 提供翻译同义词,meaning等功能。
spacy 模型的地址:https://github.com/explosion/spacy-models,每个模型详细介绍https://spacy.io/usage/models
下载模型pip install -U spacypython -m spacy download en_core_web_sm
.
基本操作
1 | import spacy |
1 | import re |
pos的缩写代表的意思
POS Tag | Description | Example | |
---|---|---|---|
0 | CC | coordinating conjunction | and |
1 | CD | cardinal number | 1, third |
2 | DT | determiner | the |
3 | EX | existential there | there, is |
4 | FW | foreign word | d’hoevre |
5 | IN | preposition or subordinating conjunction | in, of, like |
6 | JJ | adjective | big |
7 | JJR | adjective, comparative | bigger |
8 | JJS | adjective, superlative | biggest |
9 | LS | list marker | 1) |
10 | MD | modal | could, will |
11 | NN | noun, singular or mass | door |
12 | NNS | noun plural | doors |
13 | NNP | proper noun, singular | John |
14 | NNPS | proper noun, plural | Vikings |
15 | PDT | predeterminer | both the boys |
16 | POS | possessive ending | friend‘s |
17 | PRP | personal pronoun | I, he, it |
18 | PRP$ | possessive pronoun | my, his |
19 | RB | adverb | however, usually, naturally, here, good |
20 | RBR | adverb, comparative | better |
21 | RBS | adverb, superlative | best |
22 | RP | particle | give up |
23 | TO | to | to go, to him |
24 | UH | interjection | uhhuhhuhh |
25 | VB | verb, base form | take |
26 | VBD | verb, past tense | took |
27 | VBG | verb, gerund or present participle | taking |
28 | VBN | verb, past participle | taken |
29 | VBP | verb, sing. present, non-3d | take |
30 | VBZ | verb, 3rd person sing. present | takes |
31 | WDT | wh-determiner | which |
32 | WP | wh-pronoun | who, what |
33 | WP$ | possessive wh-pronoun | whose |
34 | WRB | wh-abverb | where, when |
对应的表格https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
形态morphology :就是词语的词根不会变化,而多了前缀和后缀修饰。比方说一个动词有过去式,现在进行时和过去完成时,这些都认为他有一个词根。
每个句子(nlp对象)包含单词,单词由是morphology对象
1 | from spacy.morphology import Morphology |
kindred
这是一个专门处理医学文献的NLP库。
kindred提供根据pmid下载文章的函数kindred.pubtator.load(pmids)
,但是这个很不好用,会报错RuntimeError: Unable to download PubTator data after 3 retries
,就是pubmed
textacy
数据预处理模块
1 | from textacy import preprocessing |
主要的部分有: information extraction,text statistics,document similarity,data augmentation.
对extract部分比较感兴趣说一下这里的API
1 | textacy.extract.triples.subject_verb_object_triples(doc)#可以找到句子中的主谓宾语 |
主题模型是提取文本,计算抽象主题相似性的一种统计学模型。用了sklearn的 LSA, LDA, and NMF models 三个模型并将他们用到矢量化文本。