오늘도 데이터: 유방암 데이터 세트 (Wisconsin breast cancer dataset)

제조업에는 금융하고, 의학 분야와 같이 분류 할 수 있는 데이터들이 많이 있다. 이번에 소개할 데이터 셋은 사이킷런에 있는 위스콘신 유방암 데이터 셋이다. 유방암 데이터 셋은 유방암 세포 특징 10개에 대하여 평균, 표준오차, 최대 이상치가 기록되어 있다.

유방암 데이터 세트의 분석 목표는 유방암 데이터 샘플이 악성 종양인기 정상 종양인지 분류하는 문제 이다.

1. 유방암 데이터 세트

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:

- radius (mean of distances from center to points on the perimeter) 반경

- texture (standard deviation of gray-scale values) 질감

- perimeter 둘레

- area 면적

- smoothness (local variation in radius lengths) 매끄러움

- compactness (perimeter^2 / area - 1.0) 조그만 정도

- concavity (severity of concave portions of the contour) 오목함

- concave points (number of concave portions of the contour) 오목함 점의수

- symmetry 대칭

- fractal dimension ("coastline approximation" - 1) 프렉탈 차원

30개의 독립변수(Predictive attributes)가 있고 주로 판단하는 정보는 위와 같다. 즉 여러가지 요소를 가지고 악성 종양 인지 정상 종양인지 분류 하는 것이다.

데이터 관측치 : 569
Predictive attributes(독립변수) : 30
target(반응변수) : 1.0으로 분류

2. 데이터 확인 파이썬

1) 파이썬에 서 데이터 로딩 하는 법

import numpy as np from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer()

2) 아래와 같이 입력 데이터를 확인 하기

print(cancer.data.shape, cancer.target.shape)

(569, 30) (569,)

3) Target 데이터 확인 하기

    np.unique(cancer.target, return_counts=True)

   (array([0, 1]), array([212, 357]))

   Target 데이터는  1(양성) 하고 0(음성) 데이터로 이루어 져 있고,  음성은 212개 양성은 357개로

되어 있다. 

 4) 훈련 세트를 저장 해주고 

    x = cancer.data
    y = cancer.target

 5) 훈련셋(train set)과  테스트 셋 (test set) 나누기


   train set과 test set 의 비중은 일반적으로 8:2, 7:3 이고  양성 클래스와 음성클래스는 동일 하게
나누어야 한다. 

  사이킷런에서 데이터 셋 나누기 
  from sklearn.model_selection import train_test_split   
  x_train, x_test, y_train, y_test = train_test_split(x, y, stratify =y, test_size = 0.2, random_state=42)

매계변수 설명
statify = y
     훈련 데이터를 나눌때, 클래스 비율을 동일하게 나누어 줌
test_size = 0.2    
      훈련 데이터와 테스트 데이터의 비중을 8:2로 나누어 줌 , default 7.5 :2.5
random_state = 42 무작위로 섞은 데이터가 다음 번에 실행 해도 결과가 같게 나오게 함

로지스틱 회귀 구현 하기 

  ※  위의 수학식에 대한 증명은 아래를 클릭 하면 된다.  
      여기를 클릭 하십시오 분류하는 뉴런 part 1.

     
  


※ 참고서적 : 정직하게 코딩하며 배우는 딥러닝 입문 (이지스퍼블리싱)

오늘도 데이터

유방암 데이터 세트 (Wisconsin breast cancer dataset)

1. 유방암 데이터 세트

댓글 없음:

댓글 쓰기

css cheat sheet 클래스 선택자, margin(마진), display , center 조정 간단한 구성 요소

전체 페이지뷰