rdd를 dataframe으로 만드는 방법 (1.6)


1) SQLContext를 사용하는 방법


val sqlContext = new SQLContext(sc) 

import sqlContext.implicits._

rdd.toDF()





2) HiveContext를 이용해 DataFrame.createDataframe 이용


import scala.io.Source
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext 
import org.apache.spark.sql.hive.HiveContext
   
val peopleRDD = sc.textFile(filename)

val schemaString = "name age"

val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

val sqlContext = new HiveContext(sc)

val peopleDF = sqlContext.createDataFrame(rowRDD, schema)
peopleDF.registerTempTable("people")

val results = sqlContext.sql("SELECT name FROM people")
results.collect().foreach(println)




Posted by 김용환 '김용환'

댓글을 달아 주세요