Spark sql 예제이다.
scala> val dataset = Seq("samuel", "jackson", "kin").toDF("name_string")
dataset: org.apache.spark.sql.DataFrame = [name_string: string]
scala> dataset.registerTempTable("names")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> sql("""select name_string from names""").show
+-----------+
|name_string|
+-----------+
| samuel|
| jackson|
| kin|
+-----------+
scala> sql("""select name_string from names where name_string ='kin' """).show
+-----------+
|name_string|
+-----------+
| kin|
+-----------+
like 검색은 조금 신경써서 해야 한다.
일반적인 like 검색일 때는 결과가 나타나지 않는다.
scala> sql("""select name_string from names where name_string like '*' """).show
+-----------+
|name_string|
+-----------+
+-----------+
like concat을 사용하면 like 검색을 할 수 있다.
scala> sql("""select name_string from names where name_string like concat('%','sam','%') """).show
+-----------+
|name_string|
+-----------+
| samuel|
+-----------+
오래전 부터 SQL문이 아닌 ETL 파이프라인 방식으로 사용할 수도 있었다.
(ETL 파이프은 조금 쓰기 불편하다.. )
scala> dataset.groupBy("name_string").count().filter($"count" >= 1).show()
+-----------+-----+
|name_string|count|
+-----------+-----+
| jackson| 1|
| kin| 1|
| samuel| 1|
+-----------+-----+
scala> dataset.groupBy("name_string").count().filter($"count" >= 2).show()
+-----------+-----+
|name_string|count|
+-----------+-----+
+-----------+-----+
scala> dataset.select("name_string").show()
+-----------+
|name_string|
+-----------+
| samuel|
| jackson|
| kin|
+-----------+
scala> dataset.select("name_string").where($"name_string".equalTo("samuel")).show()
+-----------+
|name_string|
+-----------+
| samuel|
+-----------+
scala> val dataset2 = Seq(("samuel", "01/05/2017"), ("noah", "01/05/2018"))
dataset2: Seq[(String, String)] = List((samuel,01/05/2017), (noah,01/05/2018))
scala> val dataset2 = Seq(("samuel", "01/05/2017"), ("noah", "01/05/2018")).toDF("name", "create_date")
dataset2: org.apache.spark.sql.DataFrame = [name: string, create_date: string]
scala> dataset2.registerTempTable("reservation")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> sql("""SELECT * from reservation""").show
+------+-----------+
| name|create_date|
+------+-----------+
|samuel| 01/05/2017|
| noah| 01/05/2018|
+------+-----------+
몇 요일인지 확인하려면 다음과 같다.
scala> sql("""SELECT name,create_date,from_unixtime(unix_timestamp(create_date, 'MM/dd/yyyy'), 'EEEE') as day from reservation where name='samuel' """).show
+------+-----------+--------+
| name|create_date| day|
+------+-----------+--------+
|samuel| 01/05/2017|Thursday|
+------+-----------+--------+
이번에는 cass class를 이용한 sql 작업이다.
scala> case class Num(x:Int)
defined class Num
scala> val rdd=sc.parallelize(List(Num(1), Num(2), Num(3)))
rdd: org.apache.spark.rdd.RDD[Num] = ParallelCollectionRDD[12] at parallelize at <console>:34
scala> spark.createDataFrame(rdd).show
+---+
| x|
+---+
| 1|
| 2|
| 3|
+---+
scala> val df = spark.createDataFrame(rdd)
df: org.apache.spark.sql.DataFrame = [x: int]
scala> df.registerTempTable("num")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> sql("""select * from num where x=2""").show
+---+
| x|
+---+
| 2|
+---+
'scala' 카테고리의 다른 글
[spark] join 예제 (0) | 2017.05.23 |
---|---|
[spark] where과 filter의 차이 (0) | 2017.05.23 |
[spark2] spark2 rdd 생성 -makeRDD (0) | 2017.04.29 |
[scala] 라인 피드("\n") 관련 예시 코드 (0) | 2017.04.24 |
[scala] Iterator의 continually함수 (0) | 2017.04.24 |