'2017/08 글 목록 (5 Page)

[spark2] groupByKey를 쓰지 않도록 한다

scala 2017. 8. 10. 19:24

spark에서 groupByKey를 사용할 때 성능에 많이 떨어질 수 있다.

좋은 설명을 포함한 링크가 있다.

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

groupByKey를 살펴보면, 키 값으로 분류를 하고 모든 계산을 하나씩 진행한다. 따라서 모든 데이터 복사가 많이 일어날 수 있다.

반면 reduceByKey에서는 계산을 진행할때 데이터 셔플 전에 노드 내에서 조금 계산해놓는다. 따라서 불필요한 데이터가 전달되지 않기 때문에 네트웍 트래픽, 복사 비용이 groupByKey보다 줄어든다.

마치 map/reduce의 custom combiner와 비슷한 느낌으로 동작한다.

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

저작자표시 (새창열림)

'scala' 카테고리의 다른 글

[play2] import play.db.Database 에러 (0)	2017.10.30
[sbt] dependency 추가시 %%(double percent)와 %(percent) 차이점 (0)	2017.10.30
[spark2] mapPartitionWithIndex 예제 (0)	2017.08.10
[scala] Product 이해하기 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08

Posted by '김용환'

,

[spark2] mapPartitionWithIndex 예제

scala 2017. 8. 10. 18:58

RDD map을 사용 하기전에 특정 라인(예, 첫번째 라인)을 사용하고 싶지 않다면 다음과 같은 mapPartitionWithIndex()를 사용한다.

rdd.mapPartitionsWithIndex(
      (i, iterator) => if (i == 0) iterator.drop(1) else iterator)

예제는 다음과 같다.

scala> val rdd = sc.parallelize(List("samuel", "kyle", "jun", "ethan", "crizin"), 5)

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.mapPartitionsWithIndex((i, iterator) => if (i == 0) iterator.drop(1) else iterator).foreach(println)

kyle

crizin

ethan

jun

scala> rdd.mapPartitionsWithIndex((i, iterator) => if (i % 2 == 0) iterator.drop(1) else iterator).foreach(println)

kyle

ethan

저작자표시 (새창열림)

'scala' 카테고리의 다른 글

[sbt] dependency 추가시 %%(double percent)와 %(percent) 차이점 (0)	2017.10.30
[spark2] groupByKey를 쓰지 않도록 한다 (0)	2017.08.10
[scala] Product 이해하기 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08
[spark2] partitonBy, HashPartitioner, RangePartitioner 예제 (0)	2017.08.07

Posted by '김용환'

,

[cassandra3] 컬렉션과 사용자 정의 타입(udt) (0)	2017.08.12
[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결 (0)	2017.08.12
[cassandra3] select now() (0)	2017.08.09
[cassandra3] schema 백업(backup)/복구(restore)하기 (0)	2017.08.08
[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기 (0)	2017.08.08

[spark2] groupByKey를 쓰지 않도록 한다 (0)	2017.08.10
[spark2] mapPartitionWithIndex 예제 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08
[spark2] partitonBy, HashPartitioner, RangePartitioner 예제 (0)	2017.08.07
[spark2] cache()와 persist()의 차이 (0)	2017.08.01

[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결 (0)	2017.08.12
[cassandra] node local의 의미 (0)	2017.08.10
[cassandra3] schema 백업(backup)/복구(restore)하기 (0)	2017.08.08
[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기 (0)	2017.08.08
[cassandra3] 복합 기본 키(compound primary key) (0)	2017.07.06

[cassandra3] schema 백업(backup)/복구(restore)하기

cassandra 2017. 8. 8. 19:33

전체 keyspace를 덤프뜨려면 다음과 같이 진행한다.

$ ./bin/cqlsh -e "desc schema"

CREATE KEYSPACE users WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

CREATE TABLE users. follow_relation (

...

}

파일로 저장하려면 다음과 같이 진행한다.

$ ./bin/cqlsh -e "desc schema" > schema.cql

특정 keyspace만 파일로 저장하려면 다음과 같이 진행한다.

$ ./bin/cqlsh -e "desc keyspace my_status" > my_status.cql

$ cat schema.cql

CREATE KEYSPACE my_status WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

CREATE TABLE my_status.follow_relation (

followed_username text,

follower_username text,

....

}

생성된 keyspace 파일을 import하는 방법은 cqlsh에 들어가서 source 명령을 사용하면 된다.

$./bin/cqlsh

Connected to Test Cluster at 127.0.0.1:9042.

[cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native protocol v4]

Use HELP for help.

cqlsh> source 'schema.cql'

cqlsh> use my_status;

cqlsh:my_status> describe my_status;

저작자표시 (새창열림)

'cassandra' 카테고리의 다른 글

[cassandra] node local의 의미 (0)	2017.08.10
[cassandra3] select now() (0)	2017.08.09
[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기 (0)	2017.08.08
[cassandra3] 복합 기본 키(compound primary key) (0)	2017.07.06
cassandra의 라이브러리를 사용한 UUID version1 테스트 (0)	2017.07.06

Posted by '김용환'

,

[spark] [펌질] wide dependecy, narrow dependency

scala 2017. 8. 8. 18:37

spark 코딩을 할 때 깊이 생각안하고 대충 짠 것을 후회했다. 그냥 동작만 되길 바라면서 했던 것들이 많이 기억났다.

spark의 coursera 강의 중 wide dependency와 narrow dependency에 대한 설명이 나오는데, 많은 영감을 주어서 잘 펌질해본다.

https://github.com/rohitvg/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies

Transformations with (usually) Narrow dependencies:

map
mapValues
flatMap
filter
mapPartitions
mapPartitionsWithIndex

Transformations with (usually) Wide dependencies: (might cause a shuffle)

cogroup
groupWith
join
leftOuterJoin
rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce

저작자표시 (새창열림)

'scala' 카테고리의 다른 글

[spark2] mapPartitionWithIndex 예제 (0)	2017.08.10
[scala] Product 이해하기 (0)	2017.08.10
[spark2] partitonBy, HashPartitioner, RangePartitioner 예제 (0)	2017.08.07
[spark2] cache()와 persist()의 차이 (0)	2017.08.01
[scala] scalatest에서 Exception 처리 (0)	2017.07.27

Posted by '김용환'

,

[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기

cassandra 2017. 8. 8. 15:29

카산드라(cassandra)에서 IN과 ORDER BY를 함께 싸용하면 다음과 같은 에러가 발생할 수 있다.

(참고로 ORDER BY 다음에는 클러스터링 키를 사용함으로서, 원하는 대로 파티션 키와 상관없이 생성 시간을 내림차순으로 결과를 얻을 수 있다)

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query"

이 때에는 PAGING OFF라는 커맨드를 사용하면 에러가 발생하지 않고 정상적으로 동작한다.

저작자표시 (새창열림)

'cassandra' 카테고리의 다른 글

[cassandra3] select now() (0)	2017.08.09
[cassandra3] schema 백업(backup)/복구(restore)하기 (0)	2017.08.08
[cassandra3] 복합 기본 키(compound primary key) (0)	2017.07.06
cassandra의 라이브러리를 사용한 UUID version1 테스트 (0)	2017.07.06
[cassandra] null의 개념 (0)	2017.07.03

Posted by '김용환'

,

[spark2] partitonBy, HashPartitioner, RangePartitioner 예제

scala 2017. 8. 7. 17:59

RDD에 partitonBy 메소드를 호출하면서 Partitioner를 정할 수 있다.

기본 Partitioner(https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/Partitioner.html)로는 HashPartitioner, RangePartitioner가 존재한다.

우선 HashPartitioner를 사용한다. 파티셔닝을 해쉬로 퍼트릴 수 있기 때문에 유용하다.

먼저 5개의 파티션으로 RDD를 생성했다가 Partitioning을 3개의 HashPartitioner를 사용하는 예제이다.

scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)), 5)

pairs: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> pairs.partitioner

res1: Option[org.apache.spark.Partitioner] = None

scala> import org.apache.spark.HashPartitioner

import org.apache.spark.HashPartitioner

scala> val partitioned = pairs.partitionBy(new HashPartitioner(3)).persist()

partitioned: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[3] at partitionBy at <console>:27

scala> partitioned.collect

res2: Array[(Int, Int)] = Array((2,2), (1,1), (3,3))

scala> pairs.partitions.length

res7: Int = 5

scala> partitioned.partitions.length

res8: Int = 3

scala> pairs.partitions

res5: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@6ba, org.apache.spark.rdd.ParallelCollectionPartition@6bb, org.apache.spark.rdd.ParallelCollectionPartition@6bc, org.apache.spark.rdd.ParallelCollectionPartition@6bd, org.apache.spark.rdd.ParallelCollectionPartition@6be)

scala> partitioned.partitions

res6: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ShuffledRDDPartition@0, org.apache.spark.rdd.ShuffledRDDPartition@1, org.apache.spark.rdd.ShuffledRDDPartition@2)

persist()는 shuffle을 이미 되도록 해놓기 때문에 성능상 이점을 가진다. 실무에서 사용할 때 유용한 팁이다.

참고로 RDD.toDebugString() 메소드가 존재하는데 shuffle RDD인지 아닌지를 파악할 때 도움이 된다.

scala> partitioned.toDebugString

res11: String =

(3) ShuffledRDD[8] at partitionBy at <console>:27 [Memory Deserialized 1x Replicated]

| CachedPartitions: 3; MemorySize: 192.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B

+-(5) ParallelCollectionRDD[7] at parallelize at <console>:24 [Memory Deserialized 1x Replicated]

scala> pairs.toDebugString

res13: String = (5) ParallelCollectionRDD[7] at parallelize at <console>:24 []

다음은 RangePartitioner 예제이다. 내용은 비슷해보인다.

scala> import org.apache.spark.RangePartitioner

import org.apache.spark.RangePartitioner

scala> new RangePartitioner(3, pairs)

res9: org.apache.spark.RangePartitioner[Int,Int] = org.apache.spark.RangePartitioner@7d2d

scala> val rangePartitioned = pairs.partitionBy(new RangePartitioner(3, pairs)).persist()

rangePartitioned: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[8] at partitionBy at <console>:28

scala> rangePartitioned.collect

res10: Array[(Int, Int)] = Array((1,1), (2,2), (3,3))

scala> rangePartitioned.partitions.length

res11: Int = 3

RangePartitioner API(https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/RangePartitioner.html)를 살펴보면, ordering와 정렬순서(오름차순/내림차순)으로 할 수 있는 형태가 있다. HashPartitioner와 크게 다른 내용이라 할 수 있을 듯 싶다.

소스 : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala

public RangePartitioner(int partitions,
                RDD<? extends scala.Product2<K,V>> rdd,
                boolean ascending,
                scala.math.Ordering<K> evidence$1,
                scala.reflect.ClassTag<K> evidence$2)

저작자표시 (새창열림)

'scala' 카테고리의 다른 글

[scala] Product 이해하기 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08
[spark2] cache()와 persist()의 차이 (0)	2017.08.01
[scala] scalatest에서 Exception 처리 (0)	2017.07.27
[scala] scalablitz (0)	2017.07.27

Posted by '김용환'

,

[elasticsearch] indices.fielddata.cache.expire 설정

Elasticsearch 2017. 8. 2. 20:21

일래스틱서치에 필드 캐시의 expire를 설정하는 옵션(indices.fielddata.cache.expire )이 1.x 버전에 있었지만 2.0부터는 사라졌다.

https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

indices.fielddata.cache.expire

[experimental] This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.A time based setting that expires field data after a certain time of inactivity. Defaults to -1. For example, can be set to 5m for a 5 minute expiry.

이 기능이 gc를 많이 유발하고 crash를 일으키는 이슈가 있어서 사라진 듯 하다..

https://discuss.elastic.co/t/indices-fielddata-cache-expire/1183

1.4에서는 잘 사용해서 문제가 없었지만. 결국 사라진 것으로 봐서는 큰 gc 이슈를 일으킨 것으로 보인다..

어차피 2.0에서 사라졌으니.. 히스토리를 위해서 남겨둔다.

저작자표시 (새창열림)

'Elasticsearch' 카테고리의 다른 글

[elasticsearch] 쿼리 취소하기 (0)	2017.08.21
[elasticsearch5] thread pool status (0)	2017.08.18
[elasticsearch1.x] 메모리 구조 - 펌글 (0)	2017.08.02
[elasticsearch5] 핫 스레드 (hot thread) api (0)	2017.07.31
[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법 (0)	2017.07.30

Posted by '김용환'

,

'2017/08'에 해당되는 글 52건

[spark2] groupByKey를 쓰지 않도록 한다

'scala' 카테고리의 다른 글

[spark2] mapPartitionWithIndex 예제

'scala' 카테고리의 다른 글

[cassandra] node local의 의미

'cassandra' 카테고리의 다른 글

[scala] Product 이해하기

'scala' 카테고리의 다른 글

[cassandra3] select now()

'cassandra' 카테고리의 다른 글

[cassandra3] schema 백업(backup)/복구(restore)하기

'cassandra' 카테고리의 다른 글

[spark] [펌질] wide dependecy, narrow dependency

'scala' 카테고리의 다른 글

[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기

'cassandra' 카테고리의 다른 글

[spark2] partitonBy, HashPartitioner, RangePartitioner 예제

'scala' 카테고리의 다른 글

[elasticsearch] indices.fielddata.cache.expire 설정

'Elasticsearch' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바