實價登錄資料

前情提要

我是台中建設公司的小職員，老闆突然想要知道全台灣房屋的市場行情，特別想針對大城市進行投資。
所以命令你在下班前，把房屋的市場報告交出來。

小職員欲哭無淚，好想趕快下班R～～

資料分析流程

資料採集
資料清洗處理
統計及分析
視覺呈現
報告產出

1. 資料採集

有資料之後，請務必一定要先了解資料所記錄的內容
欄位名稱、資料區間、資料筆數…
資料來源
- 這分資料來源為政府開放平台下載的資料

讀檔

做資料分析一定要有資料才能進行下去，所以首先是要有資料！

data <- read.csv("/Users/pineapple/Documents/DSP/DSP集訓班/transaction.csv")

欄位說明

資料區間

必須了解資料區間
分析結果區間
與其他資料結合
觀察欄位可以發現有關於時間紀錄的欄位包括trac_year以及trac_month

資料年

顯示這筆資料皆為民國102年的交易紀錄

summary(data$trac_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     102     102     102     102     102     102

資料月

顯示這筆資料皆為民國102年1月到民國102年12月的交易紀錄

summary(data$trac_month)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   7.000   6.698  10.000  12.000

資料筆數

了解資料比數可以幫助判斷需使用的工具
每個軟體可以處理的資料筆數不盡相通

方法：直接看Environment的objects個數

資料地區

table(data$city)

## 
## 高雄市 臺北市 臺中市 新北市 
##  34460  24238  37482  57418

觀察資料

可以透過str(data)了解資料的型態與特性
利用summary(data)暸解基本的各變數統計量值
還有很多方法可以了解你的資料～

2. 資料清洗處理

透過summary(data)後，我們可以發現，老闆最關心的價錢有一些奇怪的現象
紀錄價錢的有兩個欄位
- 第13欄：price_total 總價.元.
- 第14欄：price_unit 單價.元.平方公尺.
有很多值為0
影響平均房價

summary(data)

##        X              city          district        trac_year  
##  Min.   :     1   高雄市:34460   淡水區 :  7172   Min.   :102  
##  1st Qu.: 38400   臺北市:24238   西屯區 :  5974   1st Qu.:102  
##  Median : 76800   臺中市:37482   新莊區 :  5955   Median :102  
##  Mean   : 76800   新北市:57418   北屯區 :  5881   Mean   :102  
##  3rd Qu.:115199                  新店區 :  5873   3rd Qu.:102  
##  Max.   :153598                  中和區 :  5719   Max.   :102  
##                                  (Other):117024                
##    trac_month                    trac_type              trac_content  
##  Min.   : 1.000   房地(土地+建物)     :91613   土地1建物1車位0:66792  
##  1st Qu.: 4.000   房地(土地+建物)+車位:61985   土地1建物1車位1:41031  
##  Median : 7.000                                土地2建物1車位0:14537  
##  Mean   : 6.698                                土地1建物1車位2: 7195  
##  3rd Qu.:10.000                                土地2建物1車位1: 4787  
##  Max.   :12.000                                土地3建物1車位0: 4691  
##                                                (Other)        :14565  
##  use_type                           build_type      build_ymd      
##  工  :  3233   住宅大樓(11層含以上有電梯):70725   Min.   : 100602  
##  農  :   577   公寓(5樓含以下無電梯)     :23211   1st Qu.: 780326  
##  其他:  8206   透天厝                    :21954   Median : 870506  
##  商  : 26205   華廈(10層含以下有電梯)    :20365   Mean   : 868754  
##  住  :115377   套房(1房1廳1衛)           : 9709   3rd Qu.: 991201  
##                店面(店鋪)                : 2888   Max.   :1030313  
##                (Other)                   : 4746                    
##    area_land           area_build         area_park       
##  Min.   :     0.00   Min.   :    0.04   Min.   :0.00e+00  
##  1st Qu.:    12.87   1st Qu.:   85.39   1st Qu.:0.00e+00  
##  Median :    21.81   Median :  124.19   Median :0.00e+00  
##  Mean   :    41.62   Mean   :  153.03   Mean   :2.48e+01  
##  3rd Qu.:    35.64   3rd Qu.:  178.96   3rd Qu.:8.80e+00  
##  Max.   :127088.00   Max.   :79668.64   Max.   :2.40e+06  
##                                                           
##   price_total          price_unit           age      
##  Min.   :0.000e+00   Min.   :      0   Min.   :-1.0  
##  1st Qu.:4.900e+06   1st Qu.:  42685   1st Qu.: 3.0  
##  Median :8.400e+06   Median :  67880   Median :15.0  
##  Mean   :1.288e+07   Mean   :  86176   Mean   :15.2  
##  3rd Qu.:1.458e+07   3rd Qu.: 111173   3rd Qu.:24.0  
##  Max.   :8.800e+09   Max.   :4284119   Max.   :92.0  
##                      NA's   :461

解決price_unit為0的情形

使用which
發現雖然平均單價為0，但是總價不為0
單價可以自行運算

# 觀察單價為0的資料
data[which(data$price_unit==0),]

另起爐灶

觀察欄位發現這筆資料有紀錄每筆房屋買賣的area_build土地移轉總面積.平方公尺.
總價/總面積 = 平均單價
可以使用dplyr中的mutate新增欄位
或是直接insert新的欄位
新增欄位後，發現還是有單價為零的情形，處理的方式為：
- 刪除？
- 填補中位數
- 依照自己的需求決定

# 寫法mutate
data_1 <- mutate(data,price_unit_new = price_total/area_build)

## Warning in mutate_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored

summary(data_1$price_unit_new)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   42494   67441   85452  110173 6754717

# 這裡發現紀錄為0的資料很少，所以採取直接刪除
# != 是不等於的意思
data_1 <- data_1[which(data_1$price_unit!=0),]

資料內的其它問題

建物型態
- 影響到單價
利用 EDA (探索性資料分析) 發現資料的問題以及挖掘資料的價值！

3. 統計及分析

在這一步可以進行進一步的分析

北中南地區房屋單價是否有差別
房屋的年齡是否會影響到售價
以上問題都可以透過：
- 圖表觀察
- 使用統計方法檢定

4. 視覺呈現

參考上禮拜教學文件：小0大大

不同視覺化套件

這次R語言集訓班教的是入門款：ggplot2
- 好上手
- 資源豐富
每個視覺化的套件都有優缺點
- 傳統的plot
- ggplot2
- plotly
- highcharts
- …
熟悉一個套件，其它套件當輔助即可

For example

用iris來示範(一直被使用可憐的iris)

ggplot2

# 記得安裝並且library套件
# install.packages("ggplot2")
# library(ggplot2)
iris %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + 
    geom_point(aes(color=Species, shape=Species)) +
    xlab("Sepal Length") +  ylab("Sepal Width") +
    ggtitle("Sepal Length-Width")

plotly

# 記得安裝並且library套件
# install.packages("plotly")
# library(plotly)
iris %>% 
  plot_ly(x = iris$Sepal.Length, y = iris$Sepal.Width, type = 'scatter',
          mode = 'markers', symbol = iris$Species, symbols = c('circle','x','triangle-up-open'),
          color = iris$Species , marker = list(size = 10)) %>% 
  layout(title =  "Sepal Length-Width",legend = list(orientation = 'h'))

highcharts

# 記得安裝並且library套件
# install.packages("highcharter")
# library(highcharter)
hc <- highchart()
for (Species in unique(iris$Species)) {
  hc <- hc %>%
    hc_add_series_scatter(iris$Sepal.Length[iris$Species == Species],
                          iris$Sepal.Width[iris$Species == Species],
                          name = sprintf("Species: %s", Species),
                          showInLegend = TRUE)
}

hc %>% hc_title(text = "Sepal Length-Width")

5. 報告產出

參考今天的教學文件：立筠

Rmarkdown
簡報

祝大家越來越熟悉 R 語言～～