김용환 블로그(2004-2020) :: [spark] RDD 테스트 - word count 예제

[spark] RDD 테스트 - word count 예제

scala 2017. 3. 15. 13:17

spark에서 특정 단어의 개수를 찾는 예시이다. 이 예시는 pv, uv를 뽑는데 도움이 되는 코드이다.

예제

https://kodejava.org/how-do-i-format-a-date-into-ddmmyyyy/를 참조했다.

$ cat xxx.txt

Date date = Calendar.getInstance().getTime();

// Display a date in day, month, year format

DateFormat formatter = new SimpleDateFormat("dd/MM/yyyy");

String today = formatter.format(date);

System.out.println("Today : " + today);

// Display date with day name in a short format

formatter = new SimpleDateFormat("EEE, dd/MM/yyyy");

today = formatter.format(date);

System.out.println("Today : " + today);

// Display date with a short day and month name

formatter = new SimpleDateFormat("EEE, dd MMM yyyy");

today = formatter.format(date);

System.out.println("Today : " + today);

// Formatting date with full day and month name and show time up to

// milliseconds with AM/PM

formatter = new SimpleDateFormat("EEEE, dd MMMM yyyy, hh:mm:ss.SSS a");

today = formatter.format(date);

System.out.println("Today : " + today);

scala> val codes = sc.textFile("xxx.txt")

codes: org.apache.spark.rdd.RDD[String] = xxx.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val lower = codes.map( line =>line.toLowerCase)

lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:26

scala> val words = lower.flatMap(line => line.split("\\s+"))

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:28

scala> val counts = words.map(word => (word, 1))

counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:30

scala> val frequency = counts.reduceByKey(_ + _)

frequency: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:32

scala> val invFrequency = frequency.map(_.swap)

invFrequency: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[6] at map at <console>:34

scala> invFrequency.top(10).foreach(println)

(23,)

(9,=)

(6,date)

(5,//)

(4,with)

(4,today);)

(4,today)

(4,system.out.println("today)

(4,new)

(4,formatter.format(date);)

이를 다음처럼 축약해서 쓸 수 있다.

scala> val result = sc.textFile("xxx.txt").map( line =>line.toLowerCase).flatMap(line => line.split("\\s+")).map(word => (word, 1)).reduceByKey(_ + _).map(_.swap)

result: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[21] at map at <console>:24

scala> result.top(10).foreach(println)

(23,)

(9,=)

(6,date)

(5,//)

(4,with)

(4,today);)

(4,today)

(4,system.out.println("today)

(4,new)

(4,formatter.format(date);)

필요없는 코드는 다음처럼 stopWords를 만들어 필터링할 수 있다.

scala> val stopWords = Set("", "=", "//", ")", "(", ";", ":", "+", "-", "\"")

stopWords: scala.collection.immutable.Set[String] = Set("", =, ), ", -, ;, //, +, (, :)

scala> val result = sc.textFile("xxx.txt").map( line =>line.toLowerCase).flatMap(line => line.split("\\s+")).filter(! stopWords.contains(_)).map(word => (word, 1)).reduceByKey(_ + _).map(_.swap)

result: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[34] at map at <console>:26

scala> result.top(10).foreach(println)

(6,date)

(4,with)

(4,today);)

(4,today)

(4,system.out.println("today)

(4,new)

(4,formatter.format(date);)

(4,formatter)

(3,name)

(3,display)

저작자표시 (새창열림)

'scala' 카테고리의 다른 글

[spark] 집합 함수 - union, intersection, cartesian, subtract, join, cogroup 예제 (0)	2017.03.15
[spark] sbt 빌드시 - not found: org.jboss.interceptor#jboss-interceptor-api;1.1 에러 해결 (0)	2017.03.15
[spark] spark 2.0, 2.1 사용시 주의사항 - java.util.NoSuchElementException: None.get (0)	2017.03.14
[spark] Only one SparkContext may be running in this JVM 에러 (0)	2017.03.14
[spark] submit할 때 The main method in the given main class must be static 해결하기 (0)	2017.03.14

Posted by '김용환'

,

블로그 이미지

카카오 다니는 개발자 아저씨 '김용환'

카테고리

분류 전체보기 (4074)

scribbling (409)

golang (9)

Cloud (97)

nginx (13)

Apache Storm (2)

kafka (22)

Elasticsearch (140)

MQ (1)

Redis (37)

hbase (14)

mongodb (34)

hadoop (54)

mesos and marathon (12)

scala (273)

머신러닝_딥러닝 (4)

데이터 분석 (2)

cassandra (54)

erlang (6)

소셜동향 (20)

unix and linux (231)

go lang (25)

OS concept (12)

애자일 (4)

Ruby (39)

docker (49)

java core (237)

general java (269)

아두이노 (36)

안드로이드-iOS-Webkit (34)

nosql (94)

java libs (5)

Ansible-Puppet-Chef (44)

HTML5 (14)

컴파일러 (2)

레고 마인드스톰 NXT2.0 (20)

j2me (11)

Web service (143)

web (114)

Make (DIY) 소개 (9)

eclipse (23)

c or linux (213)

R (83)

Clouding (10)

java UI (9)

paper and essay (10)

svn (13)

etc tools (76)

c sharp (2)

c++ (6)

perl (18)

java script (28)

python (65)

DB (118)

general computer (4)

Tool (75)

Trend (47)

기술사 (10)

Architecture (13)

Digital TV (8)

Security (8)

Economics (30)

Chinese (7)

After reading book (60)

After reading article or pa.. (40)

철학 (28)

부동산 (8)

나의 경제 (19)

팁앤테크 (7)

프레젠테이션 (16)

신앙 (4)

디자인-아키텍쳐 (21)

내가좋아하는음악 (2)

오스틴Today'sWord (35)

영어앤영문권 (71)

영어찬양 (1)

여행수기 (20)

리더쉽 (36)

혁신 (12)

Embedded-임베디드 (5)

영화를 보고 (23)

좋은 흔적남기기 (21)

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

Total :
Today :
Yesterday :

티스토리 초대신청

티스토리툴바