GyuYoung’s Blog

Text mining

2021.3.26 2021.3.27 Data_Science/Text_mining 530 3 mins

자연어 처리 konply : 한국어 처리 scikit-learn - feature_extraction.text : 문서 전처리 apply review 데이터와 사전 데이터 가져오기 1 2 3 4 5 6 7 8 9 10 11 12 k = [] with open('data/영화 기생충_review.txt','r') as f: for _ in f.

정규 표현식

2021.3.26 2021.3.27 Data_Science/Text_mining 1154 6 mins

1. 정규표현식(re) 에 대한 이해 및 숙지 정규표현식 regular expression 특정한 패턴과 일치하는 문자열를 ‘검색’, ‘치환’, ‘제거’ 하는 기능을 지원 정규표현식의 도움없이 패턴을 찾는 작업(Rule 기반)은 불완전 하거나, 작업의 cost가 높음 e.g) 이메일 형식 판별, 전화번호 형식 판별, 숫자로만 이루어진 문자열 등 raw string 문자열 앞에 r이 붙으면 해당 문자열이 구성된 그대로 문자열로 변환 `기본 패턴

Ensemble

2021.3.26 2021.3.26 Data_Science/Machine_Learning 1534 8 mins

`Randomforest의 데이터 활용 1 2 3 4 5 feature_columns = list(data.columns.difference(['target'])) X = data[feature_columns] y = after_mapping_target train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 42) print(train_x.shape, test_x.shape, train_y.shape, test_y.shape) (49502, 93) (12376, 93) (49502,) (12376,) 1.

RandomForest

2021.3.26 2021.3.26 Data_Science/Machine_Learning 504 3 mins

1 2 3 4 import os import pandas as pd import numpy as np from sklearn.model_selection import train_test_split 1 2 3 # 데이터 불러오기 data = pd.read_csv("./data/otto_train.csv") # Product Category data.head() # 데이터 확인 .dataframe tbody tr th:only-of-type { vertical-align: middle; } .

신경망 모형

2021.3.26 2021.3.26 Data_Science/Machine_Learning 541 3 mins

`model의 복잡도에 따른 퍼포먼스 비교 1 2 3 4 import numpy as np import pandas as pd from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap 1 2 3 4 from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.

Decision Tree

2021.3.26 2021.3.26 Data_Science/Machine_Learning 90 1 min

1 2 3 4 5 from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) 1 clf.predict([[1, 1]]) 1 2 3 from sklearn.datasets import load_iris from sklearn import tree iris=load_iris() 의사결정나무 구축 및 시각화 트리 구축 1 2 clf=tree.

SVM

2021.3.26 2021.3.26 Data_Science/Machine_Learning 361 2 mins

Support Vector Machine `가장 가까운 K개 점을 선택헤 분류 및 예측 1 2 import numpy as np import matplotlib.pyplot as plt 함수 불러오기 1 2 3 4 5 6 7 8 from sklearn import svm, datasets iris=datasets.