rdd를 dataframe으로 만드는 방법 (1.6)


1) SQLContext를 사용하는 방법


val sqlContext = new SQLContext(sc) 

import sqlContext.implicits._

rdd.toDF()





2) HiveContext를 이용해 DataFrame.createDataframe 이용


import scala.io.Source
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext 
import org.apache.spark.sql.hive.HiveContext
   
val peopleRDD = sc.textFile(filename)

val schemaString = "name age"

val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

val sqlContext = new HiveContext(sc)

val peopleDF = sqlContext.createDataFrame(rowRDD, schema)
peopleDF.registerTempTable("people")

val results = sqlContext.sql("SELECT name FROM people")
results.collect().foreach(println)




'scala' 카테고리의 다른 글

[spark] spark summit 자료  (0) 2017.02.22
[scala] Array.transpose 예시  (0) 2017.02.17
[spark] foreachPartition 예시  (0) 2017.02.14
[zepplin] 여러 spark context 사용하기  (0) 2017.02.14
scala에서 uuid 생성하는 방법  (0) 2017.02.09
Posted by '김용환'
,