'2018/10 글 목록

'2018/10'에 해당되는 글 21건

2018.10.30 Squirrel SQL 설치 후 실행 이상시 참조할 내용
2018.10.30 [spark] spark structured streaming + cassandra 연동
2018.10.29 [spark] StructType + Row value 를 함께 저장하는 예제
2018.10.29 Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases.
2018.10.25 pip 설치 모듈 확인하기
2018.10.25 [spark] - spark streaming의 누산기 예시
2018.10.23 firefox 쿠키 파싱하기 - lz4json
2018.10.22 [kafka] enable.auto.commit , auto.commit.interval.ms
2018.10.22 구글 드라이브 용량 부족시 해결 방법
2018.10.22 git - pull request할 때 발생할 수 있는 업스트림 처리

Squirrel SQL 설치 후 실행 이상시 참조할 내용

etc tools 2018. 10. 30. 18:08

Squirrel SQL 클라이트 툴에

https://acadgild.com/blog/squirrel-gui-phoenix

MACOS에서 Squirrel SQL을 설치했지만 실행이 되지 않는다. 아마도 path 이슈일 것 같다.

간단히 해결한 방법은 다음과 같다.

먼저 설치 jar로 설치하고 디폴트 설치 위치로 /Applications/SQuirreLSQL.app/에 두게 한다.

alias squirrel='/Applications/SQuirreLSQL.app/Contents/MacOS/squirrel-sql.sh'

mkdir -p /Applications/SQuirreLSQL.app/Contents/MacOS/lib/

cp /Applications/SQuirreLSQL.app/Contents/Resources/Java/lib/* /Applications/SQuirreLSQL.app/Contents/MacOS/lib/

cp /Applications/SQuirreLSQL.app/Contents/Resources/Java/squirrel-sql.jar /Applications/SQuirreLSQL.app/Contents/MacOS/

squirrel을 실행하면 클라이언트 툴이 실행된다.

저작자표시 비영리 동일조건

'etc tools' 카테고리의 다른 글

[git] clone의 mirror옵션 (0)	2018.12.31
[mac] alias code='open $@ -a "Visual Studio Code"' (0)	2018.11.22
git - pull request할 때 발생할 수 있는 업스트림 처리 (0)	2018.10.22
mac OS에서 분할 압축 (0)	2018.10.18
[윈도우] powershell을 admin권한으로 실행하기 (0)	2018.07.25

Posted by '김용환'

[spark] spark structured streaming + cassandra 연동

scala 2018. 10. 30. 16:48

spark readStream()으로 읽은 DataSet을 카산드라에 저장하는 예시 코드이다.

import com.datastax.driver.core.Session

import com.datastax.spark.connector.cql.CassandraConnector

import org.apache.spark.sql.ForeachWriter

val spark = ...

val ds = spark.readStream()

...

val connector = CassandraConnector.apply(spark.sparkContext.getConf)

val session = connector.openSession

def processRow(value: (String, String, String, String)) = {

connector.withSessionDo { session =>

session.execute(s"insert into test.log(ktag, ts, uuid, log) values(' ${value._1}', '${value._2}', '${value._3}', '${value._4}' )")

}

val writer = new ForeachWriter[(String, String, String, String)] {

override def open(partitionId: Long, version: Long) = true

override def process(value: (String, String, String, String)) = {

processRow(value)

}

override def close(errorOrNull: Throwable) = {

println(errorOrNull)

}

val query = ds.writeStream.queryName("test").foreach(writer).start

query.awaitTermination()

build.sbt에는 spark-cassandra-connector를 추가한다.

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.2"

저작자표시 비영리 동일조건

'scala' 카테고리의 다른 글

[spark] spark structed streaming 코드 + 카산드라 예시 (코드 펌) (0)	2018.11.05
[spark, kafka] object Subscribe in package kafka010 cannot be accessed in package org.apache.spark.streaming.kafka010 , symbol apply is inaccessible from this place 에러 해결하기 (0)	2018.11.02
[spark] StructType + Row value 를 함께 저장하는 예제 (0)	2018.10.29
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases. (0)	2018.10.29
[spark] - spark streaming의 누산기 예시 (0)	2018.10.25

Posted by '김용환'

[spark] StructType + Row value 를 함께 저장하는 예제

scala 2018. 10. 29. 19:39

Spark에서 원래 json 코드와 파싱된(분류된) 데이터를 한번에 보고 싶다면 아래와 같은 코드를 참조하길 바란다.

val schema = StructType(
  List(
    StructField("year", StringType, nullable = true),
    StructField("month", StringType, nullable = true),
    StructField("day", StringType, nullable = true)
  )
)

val ds = spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", 
                              config.getString(s"kafka.$phase.brokers"))
  .option("startingOffsets", "latest") 
  .option("key.deserializer", "classOf[StringDeserializer]")
  .option("value.deserializer", "classOf[StringDeserializer]")
  .option("subscribe", config.getString(s"kafka.$phase.topic.name"))
  .load()
  .selectExpr("CAST(value AS STRING)")
  .select(from_json($"value", schema).as("data"), 
                                col("value").cast("string"))
  .select("data.*", "value")
  .as[(String, String, String, String)]

저작자표시 비영리 동일조건

'scala' 카테고리의 다른 글

[spark, kafka] object Subscribe in package kafka010 cannot be accessed in package org.apache.spark.streaming.kafka010 , symbol apply is inaccessible from this place 에러 해결하기 (0)	2018.11.02
[spark] spark structured streaming + cassandra 연동 (0)	2018.10.30
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases. (0)	2018.10.29
[spark] - spark streaming의 누산기 예시 (0)	2018.10.25
[spark] 기본 파티션 개수 (0)	2018.10.12

Posted by '김용환'

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases.

scala 2018. 10. 29. 19:35

spark streaming을 처리할 때 Encoder를 잘 이해하지 못하면, 아래 에러를 많이 만나게 된다.

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases.

단순히 Serializable 이슈라 하기에는 좀..

spark을 더 공부할 수 있는 꺼리가 할 수 있다.

DataFrame 및 DataSet에 대한 이해도를 높일 수 있다.

https://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Encoder.html

https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

저작자표시 비영리 동일조건

'scala' 카테고리의 다른 글

[spark] spark structured streaming + cassandra 연동 (0)	2018.10.30
[spark] StructType + Row value 를 함께 저장하는 예제 (0)	2018.10.29
[spark] - spark streaming의 누산기 예시 (0)	2018.10.25
[spark] 기본 파티션 개수 (0)	2018.10.12
[spark] "랜덤 포레스트를 이용한 MNIST 데이터셋 분류" 예 (0)	2018.06.01

Posted by '김용환'

pip 설치 모듈 확인하기

python 2018. 10. 25. 20:21

pip로 어떤 패키지를 설치했는지 목록을 볼 수 있다. freeze 커맨드를 사용한다.

# pip freeze

celery==3.1.7

certifi==2018.8.24

...

selenium==3.14.1

six==1.11.0

urllib3==1.22

w3lib==1.19.0

websocket-client==0.51.0

Werkzeug==0.14.1

zope.interface==4.4.3

# pip freeze | grep sel

selenium==3.14.1

저작자표시 비영리 동일조건

'python' 카테고리의 다른 글

flask에서 개발할 때 jsonify, json.dump 없이 json 응답 보내기 (0)	2018.11.08
python으로 해결하는 JSONP 파싱 예시 (0)	2018.11.06
[python] 모듈 프로그래밍 환경 설정 (ModuleNotFoundError 에러 해결) (0)	2018.10.20
파이썬에서 selenium과 phantomjs를 연동한 간단 예시 (0)	2018.10.12
파이썬의 try ... import .. except 예시 (0)	2018.10.08

Posted by '김용환'

[spark] - spark streaming의 누산기 예시

scala 2018. 10. 25. 19:59

스파크 스트리밍 처리할 때 누산기(accumulator) 같이 처리해야 할 때가 있다.

아래 예시는 처리해야 할 offset을 모두 더하는(누산기) 기능이다. 잘 동작한다.

var totalLag: Long = 0

def printLag(rdd: RDD[ConsumerRecord[String, String]]): Unit = {

val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

rdd.foreachPartition { iter =>

val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)

totalLag += o.count()

}

println(s"******************total lag : $totalLag")

totalLag = 0

}

저작자표시 비영리 동일조건

'scala' 카테고리의 다른 글

[spark] StructType + Row value 를 함께 저장하는 예제 (0)	2018.10.29
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases. (0)	2018.10.29
[spark] 기본 파티션 개수 (0)	2018.10.12
[spark] "랜덤 포레스트를 이용한 MNIST 데이터셋 분류" 예 (0)	2018.06.01
[spark] 로지스틱 회귀 분석을 이용한 멀티 클래스 분류 (0)	2018.05.31

Posted by '김용환'

firefox 쿠키 파싱하기 - lz4json

Web service 2018. 10. 23. 01:55

MacOS의 최신 파이어폭스(firefox)에 인증 정보/쿠키 정보를 lz4로 암호화되어 있다. 그러나 표준이 아니라서 파이썬으로 확인해볼 수 없으나, 툴로는 확인할 수 있다.

git clone https://github.com/andikleen/lz4json.git

cd lz4json

make

cp ~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.jsonlz4 .

./lz4jsoncat sessionstore.jsonlz4

저작자표시 비영리 동일조건

'Web service' 카테고리의 다른 글

공용 IP 얻기 (0)	2019.09.14
[sentry] nginx, PG 매개 변수 튜닝 (0)	2019.03.21
크롬 브라우저의 쿠기 확인하기 - sqlite (0)	2018.10.20
[jquery] file upload 예제 (0)	2017.05.30
구글 place api : request_denied (0)	2016.06.28

Posted by '김용환'

[kafka] enable.auto.commit , auto.commit.interval.ms

kafka 2018. 10. 22. 23:18

카프카(Kafka) 컨슈머는 토픽(topic)에서 메시지를 읽는다. 갑작스럽게 종료되면 종료되기 전에 어딘가까지 읽었다는 위치(오프셋(offset))을 저장한다. 오프셋(offset)은 파티션에서 수신되는 각 메시지에 대해 계속 증가하는 정수 값인 메타 데이터 조각(piece)입니다. 각 메시지는 파티션에 고유한 오프셋 값을 갖는다.

카프카의 각 메시지는 고유한 오프셋을 갖고 오프셋은 특정 파티션에서 해당 메시지의 위치를 나타낸다.

컨슈머가 파티션에서 메시지를 읽으면 카프카는 마지막으로 사용한 메시지의 오프셋을 알 수 있다. 카프카 오프셋은 _consumer_offsets라는 토픽에 저장되며 컨슈머는 컨슘 메시지를 잊지 않고 중지한 부분부터 재시작할 수 있다.

어떻게 디폴트로 저장되는지 보려면 다음 값을 확인할 수 있다.

enable.auto.commit (기본값은 true)

auto.commit.interval.ms (기본값은 5000)

즉 컨슈머는 기본적으로 매 5초마다 카프카(Kafka)에 오프셋을 자동 커밋(commit)하거나 지정 토픽에서 데이터를 가져올 때마다 최신 오프셋을 커밋한다

만약 중복 처리를 최대한 하고 싶지 않다면 메시지의 오프셋을 수동으로 커밋(commit)한다.

그리고 enable.auto.commit 속성의 값을 false로 변경해야 한다.

(자연스럽게 auto.commit.interval.ms 값은 무시된다.)

저작자표시 비영리 동일조건

'kafka' 카테고리의 다른 글

stream stream-join 정보와 mjoin (0)	2019.03.02
[펌] kafka burrow api (0)	2018.11.20
[kafka] lag 생긴다고 파티션 추가하는 것에 대해 (0)	2018.08.31
kafka에서 retention.ms 수정하는 방법 (0)	2018.07.09
[kafka] java.io.IOException: Connection to 1 was disconnected before the response was read 에러 (0)	2018.01.18

Posted by '김용환'

구글 드라이브 용량 부족시 해결 방법

scribbling 2018. 10. 22. 14:29

구글 드라이브 용량이 부족하거든.. (용량 부족하면 이메일도 가지 않는다)

필요없는 첨부파일 메일,

용량 큰 중복 파일이 있는지 확인하고 지우고..

아래 URL에 접속해서 애매하게 남아있는 데이터를 지워야 한다.

https://drive.google.com/drive/search?q=is:unorganized%20owner:me

저작자표시 비영리 동일조건

'scribbling' 카테고리의 다른 글

쿠버네티스 네트워킹(kubernetes networking)에 대한 이해를 돕는 링크 (0)	2018.11.22
okhttp3와 moshi 개발이 편하다.. (0)	2018.11.06
[펌] commit 정리하기에 정말 좋은 git rebase 참조 링크, (0)	2018.10.19
[macos] no valid sudoers sources found, quitting 해결, sudo 이슈 해결하기 (0)	2018.09.25
[펌] maven/gradle 도커 빌드 (0)	2018.09.05

Posted by '김용환'

git - pull request할 때 발생할 수 있는 업스트림 처리

etc tools 2018. 10. 22. 10:50

pull requst할 때 자주 발생할 수 있는 것으로

원격 리모트의 새로운 브랜치가 추가되었고 이를 기반으로 pull request를 하려고 그냥 브랜치를 받고 pr하면 history가 꼬인다.

따라서 pr할 원격 리모트와 연결한 후 git checkout 받고 자신의 저장소에 push한다.

$ git checkout -b dev real/dev

$ git push origin dev

상황에 따라 upstream을 수동으로 연동해야 할 수도 있다.

$ git branch --set-upstream-to origin/dev

저작자표시 비영리 동일조건

'etc tools' 카테고리의 다른 글

[mac] alias code='open $@ -a "Visual Studio Code"' (0)	2018.11.22
Squirrel SQL 설치 후 실행 이상시 참조할 내용 (0)	2018.10.30
mac OS에서 분할 압축 (0)	2018.10.18
[윈도우] powershell을 admin권한으로 실행하기 (0)	2018.07.25
[git] commit/push한 내용을 수정해서 다시 commit/push하기 (0)	2018.06.20

Posted by '김용환'

이전 1 2 3 다음

'2018/10'에 해당되는 글 21건

Squirrel SQL 설치 후 실행 이상시 참조할 내용

'etc tools' 카테고리의 다른 글

[spark] spark structured streaming + cassandra 연동

'scala' 카테고리의 다른 글

[spark] StructType + Row value 를 함께 저장하는 예제

'scala' 카테고리의 다른 글

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.im plicits._ Support for serializing other types will be added in future releases.

'scala' 카테고리의 다른 글

pip 설치 모듈 확인하기

'python' 카테고리의 다른 글

[spark] - spark streaming의 누산기 예시

'scala' 카테고리의 다른 글

firefox 쿠키 파싱하기 - lz4json

'Web service' 카테고리의 다른 글

[kafka] enable.auto.commit , auto.commit.interval.ms

'kafka' 카테고리의 다른 글

구글 드라이브 용량 부족시 해결 방법

'scribbling' 카테고리의 다른 글

git - pull request할 때 발생할 수 있는 업스트림 처리

'etc tools' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바