Analytics

Transformersによる文書の分類

Hugging Face Transformersを使ってネガポジを判定するモデルを作ってみました。 query title label negaposi この映画は本当に面白い 0 みたいな形で教師を作り、それを投入して学習させました。東北大学の日本語 BERT モデルを事前学習モデルとし、それをSequence Classificationさせました。モデリング自体は、Google Colaboratoryを用いて実行しました。学習 !pip install transformers[ja]==4.3.3 torch==1.9 sentencepiece==0.1.91 from google.colab import drive import pandas as pd from sklearn.model_selection import train_test_split from transformers import BertJapaneseTokenizer, BertForSequenceClassification, BertForMaskedLM, pipeline, Trainer, TrainingArguments import torch drive.mount('/content/drive') training_data = pd.read_csv('/content/drive/MyDrive/Texts/negaposi-sentence.csv') training_data.head() print(len(training_data["query"].unique())) training_data[["title", "label"]].groupby("label").count() train_queries, val_queries, train_docs, val_docs, train_labels, val_labels = train_test_split( training_data["query"].tolist(), training_data["title"].tolist(), training_data["label"].tolist(), test_size=.5 ) model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking' tokenizer = BertJapaneseTokenizer.from_pretrained(model_name) train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128) val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128) model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking' tokenizer = BertJapaneseTokenizer.from_pretrained(model_name) train_encodings = tokenizer(train_queries, train_docs, truncation=True, padding='max_length', max_length=128) val_encodings = tokenizer(val_queries, val_docs, truncation=True, padding='max_length', max_length=128) model = BertForSequenceClassification.from_pretrained(model_name, num_labels = 2) for param in model.base_model.parameters(): param.requires_grad = False training_args = TrainingArguments( logging_steps=10, output_dir='models', evaluation_strategy="epoch", num_train_epochs=2000, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, save_total_limit=1, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset ) trainer.train() trainer.save_model(output_dir='/content/drive/MyDrive/Models/sentiment-mining4') 推論 !pip install transformers[ja]==4.3.3 torch==1.9 sentencepiece==0.1.91 from google.colab import drive import pandas as pd from sklearn.model_selection import train_test_split from transformers import BertJapaneseTokenizer, BertForSequenceClassification, BertForMaskedLM, pipeline, Trainer, TrainingArguments import torch drive.mount('/content/drive') model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking' tokenizer = BertJapaneseTokenizer.from_pretrained(model_name) model = BertForSequenceClassification.from_pretrained('/content/drive/MyDrive/Models/sentiment-mining4') nlp = pipeline("sentiment-analysis",model=model,tokenizer=tokenizer) nlp("この本は興味深い") nlp(“この本は興味深い”) ...

Azureに本好きを食わせる

Azureにも本好きを食わせてみた。しかし、本の頻度分布多すぎ。 import codecs import configparser from azure.core.credentials import AzureKeyCredential from azure.ai.textanalytics import TextAnalyticsClient config = configparser.ConfigParser() config.read('azure.config') endpoint = config['AZURE']['azure_endpoint'] key = config['AZURE']['azure_ai_key'] client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key)) ifs = codecs.open('N4830BU-1.txt', 'r', 'utf-8') lines = ifs.readlines() documents = [''.join(lines)] response = client.recognize_entities(documents, language = "ja") result = [doc for doc in response if not doc.is_error] for doc in result: for entity in doc.entities: print(entity.text, entity.category) プロローグ Organization 本須　麗乃 Person もとすうら Person 22歳 Quantity 本 Product 誰か PersonType 筆者 PersonType 本 Product 本屋 Location 図書館 Location 写真集 Product 外国 Location 本 Product 百科事典 Product 文学全集 Product 紙 Product 専門誌 Product 雑誌 Product 小説 Product ライトノベル Product 絵本 Product 日本 Location 素人が PersonType 同人誌 Product パラ Quantity 美酒 Product 図書館 Location 本 Product 書庫 Location 本 Product 本 Product 紙 Product インク Product そこに Location 本 Product 本 Product 書庫 Location 本 Product 本 Product 本 Product 畳 Product ベッド Product 本 Product わたし PersonType 大地震 Event 本 Product ぇ Person 司書 PersonType 大学図書館 Location 神様 PersonType 転生 Event 次 Quantity 本 Product 図書館 Location 司書 PersonType 本 Product 司書 PersonType 本 Product 本 Product 本 Product 本 Product 紙 Product インク Product 本 Product 神様 PersonType わたし PersonType 本 Product ifs.close()

GiNZAに本好きを食わせる

GiNZAに本好きの下克上のプロローグを食わせてエンティティ認識を試してみた。本須　麗乃が本と須でぶった切れた。なんか、国家安康君臣豊楽っぽくてなんだかなぁ。 from ginza import * import codecs import spacy nlp = spacy.load("ja_ginza") # GiNZAモデルの読み込み ents = [] with codecs.open('N4830BU-1.txt', 'r', 'utf-8') as text: for line in text: doc = None try: doc = nlp(line.strip()) except: pass if doc: for ent in doc.ents: ents.append(ent) for ent in ents: print(ent.text, ent.label_) 須　麗乃 Person 22歳 Age 三度 Frequency 顔 Animal_Part ニヨニヨ Doctrine_Method_Other 一冊 N_Product 目 Animal_Part 教育学 Academic 民俗学 Academic 数学 Academic 物理 Academic 化学 Academic 生物学 Academic 芸術 Academic 体育 Academic 人類 Mammal 一冊 N_Product 日本 Country 日光 Domestic_Region 肌 Animal_Part 司書資格 Position_Vocation 大学図書館 Facility_Other 司書 Position_Vocation 一日 Period_Day 司書 Position_Vocation 人間 Mammal

OpenAI Example on python

OpenAIのAPIが使えるようになったので、噂のGPT-3を試してみた。 import configparser import openai config = configparser.ConfigParser() config.read('openai.config') openai.api_key = config["OPENAI"]["API_KEY"] response = openai.Completion.create( engine="davinci", prompt="人類は多くの問題を抱えていた。大量の破壊兵器・増えつづけた人口・国際的なテロ・国家間の極端な貧富の差･･･これら問題を解決する為にある計画が実現に向かう。地球上すべての国家をある1つのコンピュータによって統括しようという大胆な計画。そしてその中央処理装置はMESIAと呼ばれていた。", temperature=0.7, max_tokens=60, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0 ) print(response) { "choices": [ { "finish_reason": "length", "index": 0, "logprobs": null, "text": " \u5927\u7fa9\u540d\u5206\u3068\u306f\u4f55\u304b\uff1f\u9053\u5fb3\u7684\u306b\u6b63\u5f53\u306a\u7406\u7531\u3068\u306f\uff1f\u7b54\u3048\u306f\u305f\u30601\u3064\u3002\u300c\u65b0\u3057\u3044\u751f\u547d\u4f53\u3092" } ], "created": 1616638002, "id": "cmpl-2h9gQIewmSimGqJ6AZgDAhqBLCSEh", "model": "davinci:2020-05-03", "object": "text_completion" } response_text = response["choices"][0]["text"] response_text ' 大義名分とは何か？道徳的に正当な理由とは？答えはただ1つ。「新しい生命体を'

Azure Text AnalyticsをPythonから呼び出す

import configparser from azure.core.credentials import AzureKeyCredential from azure.ai.textanalytics import TextAnalyticsClient import seaborn as sns import pandas as pd sns.set_style('white') config = configparser.ConfigParser() config.read('azure.config') endpoint = config['AZURE']['azure_endpoint'] key = config['AZURE']['azure_ai_key'] client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key)) documents = [ "iPhoneのマップやばいよ" ] result = client.analyze_sentiment(documents) docs = [doc for doc in result if not doc.is_error] doc = docs[0] confidience_scores = {key:value for key, value in doc.confidence_scores.items()} sentiment = pd.Series(confidience_scores) sentiment positive 0.07 neutral 0.90 negative 0.03 dtype: float64 sentiment.plot.bar() <AxesSubplot:>