rdd를 dataframe으로 만드는 방법 (1.6)
1) SQLContext를 사용하는 방법
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
rdd.toDF()
2) HiveContext를 이용해 DataFrame.createDataframe 이용
import scala.io.Source
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.hive.HiveContext
val peopleRDD = sc.textFile(filename)
val schemaString = "name age"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))
val sqlContext = new HiveContext(sc)
val peopleDF = sqlContext.createDataFrame(rowRDD, schema)
peopleDF.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.collect().foreach(println)
'scala' 카테고리의 다른 글
[spark] spark summit 자료 (0) | 2017.02.22 |
---|---|
[scala] Array.transpose 예시 (0) | 2017.02.17 |
[spark] foreachPartition 예시 (0) | 2017.02.14 |
[zepplin] 여러 spark context 사용하기 (0) | 2017.02.14 |
scala에서 uuid 생성하는 방법 (0) | 2017.02.09 |