'분류 전체보기' 카테고리의 글 목록 (82 Page)

fluentd 공부 (0)	2017.08.16
[펌] 한국 오픈 스택 2017 자료 올라옴 (0)	2017.08.16
처음 본 오픈스택 Neutron (0)	2017.07.24
처음 본 오픈스택 Swift (0)	2017.07.24
처음 본 오픈 스택 Glance (0)	2017.07.21

[facebook] explorer tab 실험 중 (0)	2017.11.02
페이스북의 react js 특허 관련 동향 (0)	2017.09.07
2018 페이스북 개발자 컨퍼런스 링크 (0)	2017.04.26
[뉴스 기사] 페이스북 사용자 이탈 가속..게시물 30% 줄어 (0)	2017.02.28
6월 마지막 소셜 동향 (0)	2016.06.29

[cassandra3] 카산드라의 필드 저장 시간 확인하기 - WRITETIME (0)	2017.08.27
[cassandra3] Partitioner의 종류 (0)	2017.08.22
[cassandra3] 컬렉션과 사용자 정의 타입(udt) (0)	2017.08.12
[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결 (0)	2017.08.12
[cassandra] node local의 의미 (0)	2017.08.10

참조 아키텍처 - uber의 distributed tracing (0)	2017.08.24
Storm/Heron 공부 링크질 (0)	2017.08.22
[펌] 카카오 뱅크 DB와 개발 언어 (0)	2017.08.14
[펌] data warehouse와 data lake의 차이 (0)	2017.07.14
[펌] Sysco가 IBM 제품을 사용해 혁신에 성공한 사례 (0)	2017.07.13

Storm/Heron 공부 링크질 (0)	2017.08.22
[펌] 페이스북 뉴스 피드 알고리즘 (0)	2017.08.14
[펌] data warehouse와 data lake의 차이 (0)	2017.07.14
[펌] Sysco가 IBM 제품을 사용해 혁신에 성공한 사례 (0)	2017.07.13
아파치 쓰리프트의 bool 타입 관련 제한 값 (0)	2017.07.05

[cassandra3] 컬렉션과 사용자 정의 타입(udt)

cassandra 2017. 8. 12. 23:47

cql 컬렉션(map, set, list)에는 기능도 많다. 자바로 개발되었던 이유로 cql을 사용해 탄력적으로 개발할 수 있다.

ALTER TABLE "user_status_updates"

ADD "starred_by_users" SET<text>;

ALTER TABLE "user_status_updates"

ADD "shared_by" LIST<text>;

ALTER TABLE "users"

ADD social_identities MAP<text,bigint>;

UPDATE 예제는 다음과 같다. UPDATE문은 INSERT문과 기반이 같은 upsert 이기 때문에 insert도 되고 update도 된다.

CQL 컬렉션의 가장 강력한 기능은 컬렉션에 개별 값을 저장할 수 있다.

UPDATE images SET tags = tags + { 'cute', 'cuddly' } WHERE name = 'cat.jpg';

UPDATE images SET tags = tags - { 'lame' } WHERE name = 'cat.jpg';

UPDATE plays SET players = 5, scores = scores + [ 14, 21 ] WHERE id = '123-afde';

UPDATE plays SET players = 5, scores = [ 12 ] + scores WHERE id = '123-afde';

UPDATE users SET favs['author'] = 'Ed Poe' WHERE id = 'jsmith'

UPDATE users SET favs = favs + { 'movie' : 'Cassablanca' } WHERE id = 'jsmith'

주의 할점은 list의 삭제와 변경은 성능 이슈가 있지만, map과 set의 삭제와 변경은 list에 비해 성능 이슈가 덜하다.

또한 컬렉션 컬럼에 인덱스를 사용할 수 있다. 그러나 성능 이슈가 있을 수 있는 대용량 트래픽에서는 사용하지 않는 것이 좋다.

map 컬렉션 컬럼에 보조 인덱스를 생성하면, map의 키와 map의 값 모두 인덱스가 생성된다. 따라서 맵의 키만 인덱스 생성을 원한다면 KEYS라는 오퍼레이터를 사용한다.

CREATE INDEX ON "users" (KEYS("social_identities"));

검색은 다음과 같이 진행한다.

SELECT "username", "social_identities"

FROM users

WHERE "social_identities" CONTAINS KEY 'facebook';

CQL에서는 컬렉션 컬럼을 부분적으로 읽을 수 없다. 컬렉션에서 데이터를 검색하는 유일한 방법은 컬렉션 전체를 읽는 것이다. 따라서 성능 이슈가 있는 곳에 사용할 때는 주의 깊게 사용해야 한다.

용량 제한이 있는데 컬렉션은 64KB를 넘지 않는 데이터를 포함할 수 있다. 하나의 컬렉션에 64KB 이상의 데이터를 추가할 수 없지만 컬렉션을 읽으려 하면 64KB까지만 데이터를 읽기 때문에 결과가 잘려 데이터 손실이 발생한다.

따라서 제한 없이 커질 수 있을 예정의 데이터는 컬렉션 컬럼에 적합하지 않다. 만약 계속 데이터가 커진다면 64KB 크기를 넘지 않는 여러 개의 컬렉션으로 쪼개야 한다.

카산드라 컬렉션의 다른 한계는 WHERE...IN 절을 사용해 여러 로우를 선택할 때 컬렉션을 읽을 수 없다는 점이다. 다음 쿼리는 에러가 발생한다.

SELECT * FROM "user_status_updates"

WHERE "username" = 'alice'

AND "id" IN (

1234

);

테이블에 컬렉션 컬럼이 존재하면 WHERE...IN을 사용할 때는 컬렉션이 아닌 컬럼만 명시적으로 선택해야 한다.

튜플도 지원한다.

CREATE TABLE cycling.route (race_id int, race_name text, point_id int, lat_long tuple<text, tuple<float,float>>, PRIMARY KEY (race_id, point_id));

컬럼을 사용하는 주요 포인트가 있다.

보조 인덱스는 단일 컬럼에만 적용할 수 있다. 예를 들어 education_history 컬럼이 각각 name과 year 컬럼으로 분리되어 있다면 해당 컬럼들의 주어진 값 조합으로 레코드를 효율적으로 검색할 수있는 인덱스를 생성할 수 없다. 튜플을 사용해 두 값을 단일 컬럼에 위치시키고 해당 컬럼에 인덱스를 추가해서 여러 컬럼에 인덱스를 추가한 것과 동일한 효과를 얻을 수 있다.

(카산드라의 보조 인덱스 단점을 컬렉션으로 해결할 수 있는 특징이 있다.)

튜플의 확장 개념인 사용자 정의 타입( udt)를 지원하기도 한다. 이름을 더 추가한다.

cqlsh> CREATE TYPE cycling.basic_info (

birthday timestamp,

nationality text,

weight text,

height text

);

CREATE TABLE cycling.cyclist_stats ( id uuid PRIMARY KEY, lastname text, basics FROZEN<basic_info>);

대부분의 상황에서 사용자 정의 타입은 이름이 포함된 필드와 부분 선택의 추가 이점을 제공하기 때문에 튜플보다 더 나은 선택이 될 것이다.

구분	셋	리스트	맵	튜플	사용자 정의 타입
크기	유연	유연	유연	고정	고정
개별 변경	가능	가능	가능	불가능	불가능
부분 선택	불가능	불가능	불가능	불가능	가능
이름-값 쌍	불가능	불가능	가능	불가능	가능
여러 타입	불가능	불가능	키와 값	가능	가능
인덱스	개별 엘리먼트	개별 엘리먼트	개별 엘리먼트	전체 값	전체 값
기본 키 사용 여부	불가능	불가능	불가능	가능	가능

참고

http://www.datastax.com/documentation/cql/3.3/cql/cql_reference/delete_r.html

http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html

http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_collections_c.html

http://cassandra.apache.org/doc/old/CQL-3.0.html#collections

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tupleType.html

https://docs.datastax.com/en/cql/latest/cql/cql_using/useCreateUDT.html

저작자표시

'cassandra' 카테고리의 다른 글

[cassandra3] Partitioner의 종류 (0)	2017.08.22
[cassandra3] (0)	2017.08.15
[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결 (0)	2017.08.12
[cassandra] node local의 의미 (0)	2017.08.10
[cassandra3] select now() (0)	2017.08.09

Posted by '김용환'

,

[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결

cassandra 2017. 8. 12. 10:49

Mac OS에서 카산드라를 실행했다가 비정상 종료가 되면 가끔 다음 에러가 발생할 때가 있다.

재시작을 한다 한들 결과를 똑같다.

INFO [main] 2017-08-11 23:18:05,341 CommitLog.java:157 - Replaying ./bin/../data/commitlog/CommitLog-6-1502171952033.log, ./bin/../data/commitlog/CommitLog-6-1502171952034.log, ./bin/../data/commitlog/CommitLog-6-1502422474087.log, ./bin/../data/commitlog/CommitLog-6-1502422474088.log, ./bin/../data/commitlog/CommitLog-6-1502422504239.log, ./bin/../data/commitlog/CommitLog-6-1502422504240.log, ./bin/../data/commitlog/CommitLog-6-1502452966387.log, ./bin/../data/commitlog/CommitLog-6-1502452966388.log, ./bin/../data/commitlog/CommitLog-6-1502457013860.log, ./bin/../data/commitlog/CommitLog-6-1502457013861.log, ./bin/../data/commitlog/CommitLog-6-1502457041056.log, ./bin/../data/commitlog/CommitLog-6-1502457041057.log

ERROR [main] 2017-08-11 23:18:05,622 JVMStabilityInspector.java:82 - Exiting due to error while processing commit log during initialization.

org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Unexpected error deserializing mutation; saved to /var/folders/ch/zbmq4sk149gcz172ylw54m140000gp/T/mutation410555022742916493dat. This may be caused by replaying a mutation against a table with the same name but incompatible schema. Exception follows: java.io.IOError: java.io.IOException: Corrupt empty row found in unfiltered partition

at org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:409) [apache-cassandra-3.10.jar:3.10]

at org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:342) [apache-cassandra-3.10.jar:3.10]

at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:201) [apache-cassandra-3.10.jar:3.10]

at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:84) [apache-cassandra-3.10.jar:3.10]

at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:140) [apache-cassandra-3.10.jar:3.10]

commit 로그를 replay하다가 에러가 발생한 이유인데, commit log가 바이너리라서 읽기도 애매하긴 하다. 정확한 문제를 찾기도 전에 어려워질 수 있다. commit 로그를 백업 디렉토리에 move시켜 놓고 다시 재시작하면 정상적으로 동작한다.

mkdir -p ~/dev/backup/

mv data/commitlog ~/dev/backup/

./bin/cassandra

저작자표시

'cassandra' 카테고리의 다른 글

[cassandra3] (0)	2017.08.15
[cassandra3] 컬렉션과 사용자 정의 타입(udt) (0)	2017.08.12
[cassandra] node local의 의미 (0)	2017.08.10
[cassandra3] select now() (0)	2017.08.09
[cassandra3] schema 백업(backup)/복구(restore)하기 (0)	2017.08.08

Posted by '김용환'

,

[spark2] groupByKey를 쓰지 않도록 한다

scala 2017. 8. 10. 19:24

spark에서 groupByKey를 사용할 때 성능에 많이 떨어질 수 있다.

좋은 설명을 포함한 링크가 있다.

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

groupByKey를 살펴보면, 키 값으로 분류를 하고 모든 계산을 하나씩 진행한다. 따라서 모든 데이터 복사가 많이 일어날 수 있다.

반면 reduceByKey에서는 계산을 진행할때 데이터 셔플 전에 노드 내에서 조금 계산해놓는다. 따라서 불필요한 데이터가 전달되지 않기 때문에 네트웍 트래픽, 복사 비용이 groupByKey보다 줄어든다.

마치 map/reduce의 custom combiner와 비슷한 느낌으로 동작한다.

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

저작자표시

'scala' 카테고리의 다른 글

[play2] import play.db.Database 에러 (0)	2017.10.30
[sbt] dependency 추가시 %%(double percent)와 %(percent) 차이점 (0)	2017.10.30
[spark2] mapPartitionWithIndex 예제 (0)	2017.08.10
[scala] Product 이해하기 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08

Posted by '김용환'

,

[spark2] mapPartitionWithIndex 예제

scala 2017. 8. 10. 18:58

RDD map을 사용 하기전에 특정 라인(예, 첫번째 라인)을 사용하고 싶지 않다면 다음과 같은 mapPartitionWithIndex()를 사용한다.

rdd.mapPartitionsWithIndex(
      (i, iterator) => if (i == 0) iterator.drop(1) else iterator)

예제는 다음과 같다.

scala> val rdd = sc.parallelize(List("samuel", "kyle", "jun", "ethan", "crizin"), 5)

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.mapPartitionsWithIndex((i, iterator) => if (i == 0) iterator.drop(1) else iterator).foreach(println)

kyle

crizin

ethan

jun

scala> rdd.mapPartitionsWithIndex((i, iterator) => if (i % 2 == 0) iterator.drop(1) else iterator).foreach(println)

kyle

ethan

저작자표시

'scala' 카테고리의 다른 글

[sbt] dependency 추가시 %%(double percent)와 %(percent) 차이점 (0)	2017.10.30
[spark2] groupByKey를 쓰지 않도록 한다 (0)	2017.08.10
[scala] Product 이해하기 (0)	2017.08.10
[spark] [펌질] wide dependecy, narrow dependency (0)	2017.08.08
[spark2] partitonBy, HashPartitioner, RangePartitioner 예제 (0)	2017.08.07

Posted by '김용환'

,

[cassandra] node local의 의미

cassandra 2017. 8. 10. 16:26

cassandra 핵심 내용중 batch log에 대한 내용이 아래 url에 적혀 있다.

https://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2

batchlog 테이블은 node local이다..

The batchlog table is node-local, along with the rest of the system keyspace.

노드 로컬(node-local) : 배치가 실행되는 노드에서 배치 로그가 저장된다는 것을 의미한다.

저작자표시

'cassandra' 카테고리의 다른 글

[cassandra3] 컬렉션과 사용자 정의 타입(udt) (0)	2017.08.12
[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결 (0)	2017.08.12
[cassandra3] select now() (0)	2017.08.09
[cassandra3] schema 백업(backup)/복구(restore)하기 (0)	2017.08.08
[cassandra3] Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query 해결하기 (0)	2017.08.08

Posted by '김용환'

,

'분류 전체보기'에 해당되는 글 4074건

[펌] fluentd 사용 사례

'Cloud' 카테고리의 다른 글

페이스북 메신저의 광고

'소셜동향' 카테고리의 다른 글

[cassandra3]

'cassandra' 카테고리의 다른 글

[펌] 페이스북 뉴스 피드 알고리즘

'scribbling' 카테고리의 다른 글

[펌] 카카오 뱅크 DB와 개발 언어

'scribbling' 카테고리의 다른 글

[cassandra3] 컬렉션과 사용자 정의 타입(udt)

'cassandra' 카테고리의 다른 글

[cassandra3] commit log - Unexpected error deserializing mutation 에러 해결

'cassandra' 카테고리의 다른 글

[spark2] groupByKey를 쓰지 않도록 한다

'scala' 카테고리의 다른 글

[spark2] mapPartitionWithIndex 예제

'scala' 카테고리의 다른 글

[cassandra] node local의 의미

'cassandra' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바