'2017/07 글 목록

[elasticsearch5] 핫 스레드 (hot thread) api

Elasticsearch 2017. 7. 31. 19:08

핫 스레드 API는 여러 정보를 포함한 형태를 가진 텍스트로 리턴한다. 즉 JSON 구조로 리턴하지 않는 형태를 갖고 있다.

응답 구조 자체에 대해 설명하기 전에 핫 스레드 API의 응답을 생성하는 로직을 짧게 소개한다.

일래스틱서치는 먼저 실행 중인 모든 스레드를 얻은 후 각 스레드에서 소비한 CPU 시간, 특정 스레드가 차단되었거나 대기 상태에 있었던 횟수, 차단된 시간 또는 대기 상태에 있었던 시간 등에 대한 다양한 정보를 수집한다.

다음에는 특정 시간(interval 매개 변수로 지정) 동안 기다린 후 시간이 지나면 동일한 정보를 다시 수집한다.

이 작업이 완료되면 각 특정 스레드가 실행되고 있는 시간에 따라 스레드가 정렬된다. 가장 오랜 기간 실행 중인 스레드가 목록 맨 위에 오도록 내림차순으로 정렬된다.

(이전에 언급된 시간은 type 매개 변수에 지정된 오퍼레이션 타입을 기반으로 측정된다. )

그 다음 일래스틱서치는 첫 번째 N개의 스레드(N은 threads 매개 변수로 지정된 스레드 개수)를 분석한다.

일래스틱서치는 몇 밀리 초마다 이전 단계에서 선택한 스레드의 스택 트레이스(stack trace)의 일부 스냅샷(스냅 샷 수는 스냅 샷 매개 변수로 지정)을 사용한다.

마지막으로 해야 할 일은 스레드 상태의 변경을 시각화하고, 호출 함수에게 응답을 리턴하기 위해 스택 트레이스를 그룹핑하는 것이다.

threads 개수는 기본 3개이고 간격은 500ms이며 type의 기본 값은 cpu이다.

간단한 예제를 보면 다음과 같다.

$ curl 'localhost:9200/_nodes/hot_threads?type=wait&interval=1s'

::: {5OEGj_a}{5OEGj_avT8un0nOak28qQg}{DAzM0ktKQNS047ggd9nYZQ}{127.0.0.1}{127.0.0.1:9300}

Hot threads at 2017-07-31T11:04:59.943Z, interval=1s, busiestThreads=3, ignoreIdleThreads=true:

8.4% (35.1ms out of 1000ms) cpu usage by thread 'elasticsearch[kemi][search][T#2]'

10/10 snapshots sharing following 8 elements

sun.misc.Unsafe.park(Native Method)

java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)

java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)

org.elasticsearch.bootstrap.Bootstrap$1.run(Bootstrap.java:84)

java.lang.Thread.run(Thread.java:745)

....

결과의 첫 부분을 보면..

핫 스레드 API 정보를 리턴하는 노드가 어느 노드인지 쉽게 알 수 있고 핫 스레드 API 호출이 언제 많은 노드로 전달되는 시점을 알 수 있다.

두 번째 부분은

8.4% (35.1ms out of 1000ms) cpu usage by thread 'elasticsearch[kemi][search][T#2]'

해당 스레드는 측정이 완료된 시점의 모든 CPU 시간 중 8.4%를 차지함을 알 수 있다.

cpu usage 부분은 cpu와 동일한 type을 사용하고 있음을 나타낸다 (여기에서 예상할 수 있는 다른 값은 블럭(block) 상태에 있는 스레드의 블럭 사용량(block usage)와 대기 상태에 있는 스레드의 대기 사용량(wait usage)이다). 스레드 이름은 여기에서 매우 중요하다.

스레드를 살펴보면 해당 일래스틱서치 스레드가 가장 핫한 스레드임을 알 수 있다. 이 예제의 핫 스레드가 모두 검색(search 값)이라는 것을 알 수 있다.

볼 수 있는 다른 값으로는 recovery_stream(복구 모듈 이벤트), cache(이벤트 캐시), merge(세그먼트 병합), index(데이터 저장 스레드) 등이 있다.

관련 내용은 다음 코드를 확인한다.

https://github.com/elastic/elasticsearch/blob/v5.2.1/core/src/main/java/org/elasticsearch/action/admin/cluster/node/hotthreads/NodesHotThreadsRequest.java

public class NodesHotThreadsRequest extends BaseNodesRequest<NodesHotThreadsRequest> {

int threads = 3;

String type = "cpu";

TimeValue interval = new TimeValue(500, TimeUnit.MILLISECONDS);

int snapshots = 10;

boolean ignoreIdleThreads = true;

// for serialization

public NodesHotThreadsRequest() {

}

/**

* Get hot threads from nodes based on the nodes ids specified. If none are passed, hot

* threads for all nodes is used.

*/

public NodesHotThreadsRequest(String... nodesIds) {

super(nodesIds);

}

public int threads() {

return this.threads;

}

public NodesHotThreadsRequest threads(int threads) {

this.threads = threads;

return this;

}

public boolean ignoreIdleThreads() {

return this.ignoreIdleThreads;

}

public NodesHotThreadsRequest ignoreIdleThreads(boolean ignoreIdleThreads) {

this.ignoreIdleThreads = ignoreIdleThreads;

return this;

}

public NodesHotThreadsRequest type(String type) {

this.type = type;

return this;

}

public String type() {

return this.type;

}

public NodesHotThreadsRequest interval(TimeValue interval) {

this.interval = interval;

return this;

}

public TimeValue interval() {

return this.interval;

}

public int snapshots() {

return this.snapshots;

}

public NodesHotThreadsRequest snapshots(int snapshots) {

this.snapshots = snapshots;

return this;

}

@Override

public void readFrom(StreamInput in) throws IOException {

super.readFrom(in);

threads = in.readInt();

ignoreIdleThreads = in.readBoolean();

type = in.readString();

interval = new TimeValue(in);

snapshots = in.readInt();

}

@Override

public void writeTo(StreamOutput out) throws IOException {

super.writeTo(out);

out.writeInt(threads);

out.writeBoolean(ignoreIdleThreads);

out.writeString(type);

interval.writeTo(out);

out.writeInt(snapshots);

}

저작자표시

'Elasticsearch' 카테고리의 다른 글

[elasticsearch] indices.fielddata.cache.expire 설정 (0)	2017.08.02
[elasticsearch1.x] 메모리 구조 - 펌글 (0)	2017.08.02
[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법 (0)	2017.07.30
[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델 (0)	2017.07.29
[elasticsearch5] 집계 (aggregation) 성능 향상 (0)	2017.07.26

Posted by '김용환'

,

[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법

Elasticsearch 2017. 7. 30. 09:03

루씬 6.0이전의 기본 유사도 모델은 TF-IDF 모델이었지만 루씬 6.0이후의 기본 유사도 모델은 BM25로 변경되었다.

일래스틱서치 5.0부터 루씬 6.0을 사용하면서 기본 유사도 모델이 BM25로 변경되었다.

BM25 외에 사용할 수 있는 다른 유사도 모델은 다음과 같다.

* TF-IDF(기존 방식) : TF-IDF 모델을 기반으로 하고 일래스틱서치 5.0 이전 버전의 기본 유사도 모델로 사용되었다. 일래스틱서치에 해당 유사도 모델을 사용하려면 classic 이름을 사용해야 한다.

* DFR(divergence from randomness) : 동일 이름의 확률적 모델을 기반으로 한다. 일래스틱서치에서 해당 유사도 모델을 사용하려면 DFR 이름을 사용해야 한다. 랜덤 유사성 모델로부터 나온 디버전스는 자연어 텍스트와 비슷한 데이터에서도 잘 동작한다고 알려져 있다.

* DFI(Divergence from independence) : 동일한 이름의 확률적 모델을 기반으로 한다. 일래스틱서치에서 해당 유사도를 사용하려면 DFI 이름을 사용해야 한다.

참조

http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf

* IB(Information-based) : DFR에서 사용되는 모델과 매우 유사한다. 해당 유사도를 일래스틱서치에서 사용하려면 IB 이름을 사용해야 한다. DFR 유사도과 마찬가지로 정보 기반 모델은 자연어 텍스트와 비슷한 데이터에서 잘 수행된다고 알려져 있다.

* LM Dirichlet : Bayesian 스무딩과 Dirichlet 사전을 사용한다. 해당 유사도 모델을 사용하려면 LM Dirichlet 이름을 사용해야 한다.

참조

https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html

* LM Jelinek Mercer : Jelinek Mercer 스무딩 방법을 기반으로 한다. 해당 유사도를 사용하려면 LMJelinekMercer 이름을 사용해야 한다.

참조https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html

일래스틱서치에서는 유사도 모델과 관련 매개배수를 사용해 매핑할 때 유사도 모델을 설정할 수 있다.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html

예)

* IB 모델

"similarity" : {

"esserverbook_ib_similarity" : {

"type" : "IB",

"distribution" : "ll",

"lambda" : "df",

"normalization" : "z",

"normalization.z.z" : "0.25"

}

* LM Dirichlet 모델

"similarity" : {

"esserverbook_lm_dirichlet_similarity" : {

"type" : "LMDirichlet",

"mu" : "1000"

}

* LM Jelinek Mercer 모델

"similarity" : {

"esserverbook_lm_jelinek_mercer_similarity" : {

"type" : "LMJelinekMercer",

"lambda" : "0.7"

}

저작자표시

'Elasticsearch' 카테고리의 다른 글

[elasticsearch1.x] 메모리 구조 - 펌글 (0)	2017.08.02
[elasticsearch5] 핫 스레드 (hot thread) api (0)	2017.07.31
[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델 (0)	2017.07.29
[elasticsearch5] 집계 (aggregation) 성능 향상 (0)	2017.07.26
[elasticsearch5] elasticsearch scripting 역사 (0)	2017.07.25

Posted by '김용환'

,

[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델

Elasticsearch 2017. 7. 29. 09:22

일래스틱서치의 phrase 쿼리에 사용할 수 있는 3가지 스무딩 모델에 대한 설명이다. 어려워서 정리해놨다..

아래 일래스틱서치 문서를 보면 관련 내용이 잠깐 나온다.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html#_smoothing_models

Smoothing Modelsedit

The phrase suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index).

stupid_backoff

a simple backoff model that backs off to lower order n-gram models
if the higher order count is 0 and discounts the lower order n-gram model by a constant factor. The default discount is 0.4. Stupid Backoff is the default model.

laplace

a smoothing model that uses an additive smoothing where a constant
(typically 1.0 or smaller) is added to all counts to balance weights, The default alpha is 0.5.

linear_interpolation

a smoothing model that takes the weighted mean of the unigrams,
bigrams and trigrams based on user supplied weights (lambdas).
Linear Interpolation doesn’t have any default values. All parameters (trigram_lambda, bigram_lambda, unigram_lambda) must be supplied.

stupid backoff는 일래스틱서치의 phrase 제안자에서 사용되는 기본 스무딩 모델이다. 해당 스무딩 모델을 변경하거나 강제로 사용하려면 stupid_backoff라는 이름을 사용해야 한다. stupid backoff 스무딩 모델은 더 높은 차수의 n-gram 개수가 0이면 낮은 차수의 n-gram을 사용하는 구현이다(그리고 stupid_backoff는 discount 속성의 값과 동일한 할인을 제공한다). 예제를 설명하기 위해 일반적이고 제안자가 사용하는 인덱스에 존재하는 ab 바이그램(bigram)과 c 유니그램(unigram)을 사용한다고 가정한다. 그러나 abc 트라이그램(trigram)는 갖고 있지 않다. stupid backoff은 abc가 존재하지 않기 때문에 ab 바이그램을 사용할 것이고 물론 ab 바이그램 모델은 discount 속성의 값과 동일한 할인을 받게 될 것이다.

stupid backoff 모델은 discount 속성이라는 변경할 수 있는 단일 속성을 제공한다. 기본적으로 discount 값은 0.4로 설정되어 있고, 낮은 차수의 n-gram 모델의 discount 인자로 사용된다.

n-gram 스무딩 모델에 대한 자세한 내용은 http://en.wikipedia.org/wiki/N-gram#Smoothing_techniques와 http://en.wikipedia.org/wiki/Katz's_back-off_model(설명한 stupid backoff 모델과 비슷하다)을 통해 더 살펴볼 수 있다.

라플라스(laplace)는 부가적인 스무딩 모델이라 불린다. 라플라스가 사용될 때(라플라스를 사용하려면 laplace 값을 사용해야 한다), alpha 매개 변수의 값과 동일한 상수값(기본값은 0.5 이다)은 빈번하고 자주 발생하지 않는 가중치의 균형을 유지하기 위해 개수에 추가된다. 언급한 것처럼 라플라스는 기본값이 0.5인 alpha 매개 변수를 사용해 설정할 수 있다. 일반적으로 alpha 매개 변수의 일반적인 값은 1.0이하이다.

부가적 스무딩에 대한 자세한 내용은 http://en.wikipedia.org/wiki/Additive_smoothing을 참조한다.

저작자표시

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] 핫 스레드 (hot thread) api (0)	2017.07.31
[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법 (0)	2017.07.30
[elasticsearch5] 집계 (aggregation) 성능 향상 (0)	2017.07.26
[elasticsearch5] elasticsearch scripting 역사 (0)	2017.07.25
[elasticsearch5] 삭제된 api 확인하기 (0)	2017.07.24

Posted by '김용환'

,

조세특례제한법 제99조의 2 (2013년 매매 부동산)

부동산 2017. 7. 29. 09:01

2013년도에 부동산 경기가 좋지 않아서 조세특례제한법 99조 2가 통과되었는데,

1항에 맞는 조건자에 한해서는 5년간 양도 소득세 면제 뿐 아니라 거주자 소유주택으로 보지 않는다는 특이한 조항이 있다.

이 때 지방 자치단체로부터 확인/날감을 받은 매매계약서에 대해서는 적용이 된다고 한다. 즉, 납세지 관할 세무서장에게 제출한 경우에만 해당 과세특례를 적용받을 수 있다.

또한 2항에 따르면 2013년도에 위 조건에 맞게 부동산을 산 사람은 최소 5년(또는 최대 평생) 무택자가 될 수 있다.

추후 이 부분은 국세청에 확인할 필요가 있을 것 같다.

○ 조세특례제한법 제99조의 2 【 신축주택 등 취득자에 대한 양도소득세의 과세특례 】

① 거주자 또는 비거주자가 대통령령으로 정하는 신축주택, 미분양주택 또는 1세대 1주택자의 주택으로서 취득가액이 6억원 이하이거나 주택의 연면적(공동주택의 경우에는 전용면적)이 85제곱미터 이하인주택을 2013년 4월 1일부터 2013년 12월 31일까지 「주택법」 제38조에따라 주택을 공급하는 사업주체 등 대통령령으로 정하는 자와 최초로 매매계약을 체결하여 그 계약에 따라 취득(2013년 12월 31일까지 매매계약을 체결하고 계약금을 지급한 경우를 포함한다)한 경우에 해당 주택을 취득일부터 5년 이내에 양도함으로써 발생하는 양도소득에 대하여는 양도소득세의 100분의 100에 상당하는 세액을 감면하고, 취득일부터 5년이 지난 후에 양도하는 경우에는 해당 주택의 취득일부터 5년간 발생한 양도소득금액을 해당 주택의 양도소득세 과세대상소득금액에서 공제한다. 이 경우 공제하는 금액이과세대상소득금액을 초과하는 경우 그 초과금액은 없는 것으로 한다.

②「소득세법」 제89조제1항제3호를 적용할 때 제1항을 적용받는 주택은해당 거주자의 소유주택으로 보지 아니한다.

소득세법 제 89제1항제3호의 내용은 다음과 같다.

. 다음 각 목의 어느 하나에 해당하는 주택(가액이 대통령령으로 정하는 기준을 초과하는 고가주택은 제외한다)과 이에 딸린 토지로서 건물이 정착된 면적에 지역별로 대통령령으로 정하는 배율을 곱하여 산정한 면적 이내의 토지(이하 이 조에서 "주택부수토지"라 한다)의 양도로 발생하는 소득

가. 1세대가 1주택을 보유하는 경우로서 대통령령으로 정하는 요건을 충족하는 주택

나. 1세대가 1주택을 양도하기 전에 다른 주택을 대체취득하거나 상속, 동거봉양, 혼인 등으로 인하여 2주택 이상을 보유하는 경우로서 대통령령으로 정하는 주택

저작자표시

'부동산' 카테고리의 다른 글

Absolute triple net 계약 (0)	2018.05.15
[펌] - 재무제표로 돈 버는 회사 추려내는 방법- 박동흠 회계사 (0)	2018.05.05
[펌] 네이버 morgin님의 중 재정학 관련 조세 부담의 원칙 관련 내용 (0)	2018.03.28
2018년 2월 19일 WSJ 한국의 가계 부채 심각성 보고 (0)	2018.02.19
7 26 부동산 정책 변경 요약 (0)	2006.07.26

Posted by '김용환'

,

[scala] scalatest에서 Exception 처리

scala 2017. 7. 27. 20:20

scalatest에서 Exception처리하는 예제이다.

다음과 같은 포맷으로 개발한다.

intercept[Exception] {

메소드

}

import org.scalatest.FunSuite
import org.junit.runner.RunWith
import org.scalatest.junit.JUnitRunner

 @RunWith(classOf[JUnitRunner])
  class AppSuite extends FunSuite {

  test("test") {
      checkParam("a")
  }

  test("null test") {
    intercept[IllegalArgumentException] {
      checkParam(null)
    }
  }

  def checkParam(param: String): Int = param match {
    case null => throw new IllegalArgumentException("None is illegal.")
    case _ => 0
  }

}

저작자표시

'scala' 카테고리의 다른 글

[spark2] partitonBy, HashPartitioner, RangePartitioner 예제 (0)	2017.08.07
[spark2] cache()와 persist()의 차이 (0)	2017.08.01
[scala] scalablitz (0)	2017.07.27
[scala] 병렬 콜렉션 (par collection) (0)	2017.07.24
[scala] foldLeft, fodRight, reduceLeft, reduceRight, scanLeft, scanRight 함수 예제 (0)	2017.07.24

Posted by '김용환'

,

[scala] scalablitz

scala 2017. 7. 27. 19:49

coursera의 scala 강의 중에 scalablitz의 흔적(monoid 설명)이 있어서 함 찾아봤다.

scala 2.9(2011년)부터 parallel 패키지가 추가되었다.

그러나 3rd party로 scalablitz(http://scala-blitz.github.io/)로 있긴 했지만, 2014년 쯔음부터는 더 이상 운영되지 못했다. 이제는 역사속으로 사진 라이브러리이지만...

Parallel Collections were originally introduced into Scala in release 2.9. Why another data-parallel collections framework? While they provided programmers with seamless data-parallelism and an easy way to parallelize their computations, they had several downsides. First, the generic library-based approach in Scala Parallel Collections had some abstraction overheads that made them unsuitable for certain types of computations involving number crunching or linear algebra. To make efficient use of parallelism, overheads like boxing or use of iterators have to be eliminated. Second, pure task-based preemptive scheduling used in Scala Parallel Collections does not handle certain kinds of irregular data-parallel operations well. The data-parallel operations in this framework are provided for a wide range of collections, and they greatly reduce both of these overheads.

libraryDependencies += "com.github.scala-blitz" %% "scala-blitz" % "1.1"

스칼라의 병렬 콜렉션은 scala.collection.par 패키지를 이용할 수 있다. 스칼라 병렬 콜렉션처럼 일반 콜렉션에서 toPar 메소드를 호출하면 병렬 객체를 리턴한다.

import scala.collection.par._
import scala.collection.par.Scheduler.Implicits.global

def mean(a: Array[Int]): Int = {
  val sum = a.toPar.reduce(_ + _)
  sum / a.length
}

val m = mean(Array(1, 3, 5))
print(m)

결과 값은 3이다.

이후에 예제 코딩을 진행하면 기존 스칼라 코드와 충돌이 나면서 테스트를 계속하기 애매해진다.

Error:(25, 5) reference to text is ambiguous;

it is both defined in method totalLength and imported subsequently by

import scala._

slideshare에서 scalablitz 맛을 보는데 도움이 되는 것 같다. generic 관련해서 깔끔해진 느낌이 있긴 하다..

(가뜩이나 스칼라는 공부할수록 복잡해지는 느낌이 있긴 하다.......)

ScalaBlitz from Aleksandar Prokopec

더 궁금하면 아래 링크를 참조한다.

http://apprize.info/programming/scala/7.html

저작자표시

'scala' 카테고리의 다른 글

[spark2] cache()와 persist()의 차이 (0)	2017.08.01
[scala] scalatest에서 Exception 처리 (0)	2017.07.27
[scala] 병렬 콜렉션 (par collection) (0)	2017.07.24
[scala] foldLeft, fodRight, reduceLeft, reduceRight, scanLeft, scanRight 함수 예제 (0)	2017.07.24
[scala] 마이크로 벤치마킹 툴 - scala meter 예제 (0)	2017.07.21

Posted by '김용환'

,

[hive] reducer에 메모리 할당하기

hadoop 2017. 7. 27. 18:24

hive 쿼리를 실행시 reducer에 메모리 용량를 많이 할당해야 할 때가 있다.

이 때는 hive.exec.reducers.bytes.per.reducer를 설정하면 된다.

reducer 당 메모리 크기를 설정한다.설정된 메모리 크기를 바탕으로 reducer 개수를 정의한다. Hive 0.14.0 이후 버전의 hive.exec.reducers.bytes.per.reducer 기본값은 256MB이다. 입력 크기가 1GB이면 4개의 reducer가 적당하는 것이다.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties 문서를 보면 다음과 같다.

hive.exec.reducers.bytes.per.reducer

Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Added In: Hive 0.2.0; default changed in 0.14.0 with HIVE-7158 (and HIVE-7917)

Size per reducer. The default in Hive 0.14.0 and earlier is 1 GB, that is, if the input size is 10 GB then 10 reducers will be used. In Hive 0.14.0 and later the default is 256 MB, that is, if the input size is 1 GB then 4 reducers will be used.

Hive 0.14.0이전의 기본 값은 1G이지만, 0.14.0이후에는 256MB이다. 256MB는 가장 성능이 잘나오는 HDFS 블럭 사이즈이라 한다. (https://stackoverflow.com/questions/34419869/how-to-set-data-block-size-in-hadoop-is-it-advantage-to-change-it참조)

이외 너무 많은 reducer를 쓰지 않도록 hive.exec.reducers.max의 값을 수정할 수 있다.

hive.exec.reducers.max

Default Value: 999 prior to Hive 0.14.0; 1009 in Hive 0.14.0 and later
Added In: Hive 0.2.0; default changed in 0.14.0 with HIVE-7158 (and HIVE-7917)

Maximum number of reducers that will be used. If the one specified in the configuration property mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

잡마다 reduce의 기본 개수를 정의할 수 있다.

mapred.reduce.tasks

Default Value: -1
Added In: Hive 0.1.0

The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

저작자표시

'hadoop' 카테고리의 다른 글

[phoenix] PQS (0)	2017.10.11
[hadoop] hadoop dfs count 커맨드 예제 (0)	2017.09.12
[hive] 장치에 남은 공간이 없음 에러.. (0)	2017.03.27
[hive] missing EOF at '...' near 에러 (0)	2017.03.15
[hive] json 필드를 가진 hive table으로 hive view 만들기 (부제 : get_json_object 사용 안하기) (0)	2017.03.06

Posted by '김용환'

,

[elasticsearch5] 집계 (aggregation) 성능 향상

Elasticsearch 2017. 7. 26. 17:31

일래스틱서치 블로그에서 일래스틱서치 5.0의 집계 기능 향상한 내용을 공유한다.

출처 : https://www.elastic.co/blog/the-great-query-refactoring-thou-shalt-only-parse-once

https://github.com/elastic/elasticsearch/issues/10217

집계는 일래스틱서치 초기 버전부터 매우 비쌌고, 가장 많은 메모리를 소비했었다. 일래스틱서치 1.4에서는 샤드 쿼리 캐시(shard query cache)라고 하는 새로운 기능이 추가되었다. 샤드 쿼리 캐시 기능은 샤드 요청 캐시로 이름이 바뀌어 있었다. 샤드 쿼리 캐시의 장점은 한 인덱스 또는 하나 이상의 인덱스에 대해 검색 요청이 실행될 때, 관련된 각 샤드가 로컬에서 검색을 실행하고 해당 로컬 결과를 코디네이팅 노드(coordinating node)에 리턴한다. 해당 코디네이팅 노드는 각 샤드의 로컬 결과를 전체 결과 집합으로 합친다. 샤드 요청 캐시 모듈은 각 샤드의 로컬 결과를 캐시해서 검색 요청에 대한 결과를 즉시 리턴할 수 있도록 한다.

일래스틱서치 5.0버전 이전까지는 두 개의 명백한 문제로 인해 해당 기능이 기본적으로 비활성화였다. JSON의 순서는 결정적(deterministic)이지 않아서 두 요청이 논리적으로 동일할 수도 있지만 JSON 문자열에 렌더링 될 때는 동일하지 않을 수 있다. 전체 JSON 문자열을 기반으로 하는 샤드 캐시 키는 동일한 쿼리가 캐시의 이점을 얻을 수 없다.

대부분의 경우 사용자 쿼리는 시간 기반이고 특히 현재 시간과 관련되어 있으므로 후속 요청은 약간 다른 시간 범위를 갖는 경향이 있다. 따라서 해당 캐시를 활성화하면 캐시 히트가 거의 발생하지 않기 때문에 대부분의 경우 메모리가 낭비될 수 있다.

그러나 일래스틱서치 개발자는 지난 몇 년 동안 해당 이슈를 피하고자 샤드 레벨의 캐시 집계를 즉각적으로 만들어 기본 기능으로 제공하기 위해 많은 노력을 기울였다. 이는 검색 실행 코드의 주요 쿼리 리팩토링을 통해 가능해졌다.

5.0 이전 버전에서는 각 노드에서는 JSON 포맷의 원래 검색 요청을 수신하고 쿼리를 파싱하며 쿼리 구문의 일부로 실행되었던 루씬 쿼리를 실제로 생성하기 위해 샤드의 사용 가능한 정보(매핑과 같은 정보)를 사용했다.

5.0에서는 기존 오버 헤드가 완전히 제거되었다. 쿼리 파싱은 이제 요청을 수신하는 코디네이팅 노드에서만 발생하고 검색 요청을 사용할 수 있는 매핑과 상관 없이 직렬화 할 수 있는 중간 포맷(모든 노드가 이해할 수 있는 중간 쿼리 객체)으로 변경한다. 그런 다음 중재자 쿼리 객체는 모든 노드에서 구문 분석되어 샤드에 있는 매핑과 정보를 기반으로 실제 루씬 쿼리로 변환된다. 7장, '로우 레벨 인덱스 제어'에서 캐시에 대해 자세히 다룰 예정이다.

샤드 요청 캐시는 일래스틱서치 5.0에서 기본적으로 "size": 0 인 모든 요청에 대해 활성화된다. 샤드 요청 캐시는 사용자가 응답 시 다큐먼트를 리턴하지 않고 집계 결과를 사용해 데이터 전체적인 개요를 얻는 것에 관심있는 분석 사용 사례에 가장 유용하다.

저작자표시

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법 (0)	2017.07.30
[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델 (0)	2017.07.29
[elasticsearch5] elasticsearch scripting 역사 (0)	2017.07.25
[elasticsearch5] 삭제된 api 확인하기 (0)	2017.07.24
[elasticsearch5] 축소(shrink) api 예제 (0)	2017.06.19

Posted by '김용환'

,

[elasticsearch5] elasticsearch scripting 역사

Elasticsearch 2017. 7. 25. 20:54

#일래스틱서치 탄생

처음에는 MVEL 지원

https://github.com/mvel/mvel

(얼마 못가 지원하지 않음)

#일래스틱서치 1.4

MVEL에서 그루비로 대체

# 일래스틱서치 1.5

MVEL은 1.5 버전부터 마침내 제거되었다.

# 일래스틱서치 5.0

일래스틱서치 5.0버전부터 그루비 스크립팅 언어가 Deprecated되었고 5.0 향후 버전에서 제거될 것이다. 그루비는 새로운 언어인 Painless로 대체되었다. 그루비를 계속 사용할 수 있지만 elasticsearch.yml 파일에서 동적 스크립팅을 활성화해야 한다. Painless을 사용하기 위한 추가 설정이 필요하지 않다.

일래스틱서치는 그루비, 파이썬 뿐 아니라 자바스크립트의 언어 플러그인을 지원했었다. 일래스틱서치 5.0.0 버전부터 기존 언어의 플러그인은 사용 중단되었고 Painless으로 대체되었다.

일래스틱서치는 Painless 외에도 다음 스크립트 언어를 지원한다.

* 빠른 사용자 정의 랭킹과 정렬에 주로 사용되는 루씬 표현식

* 검색 템플릿에 사용되는 mustache

* 자바(사용자 정의 플러그인을 작성)

- Painless 예제

* Groovy와 비슷하게 생겼다.

def sum = 0

def listOfValues = [0, 1, 2, 3]

def sum = 0;

for (def i = 0; i < 10; i++) {

sum += i;

}

def sum = 0;

for ( i in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ) {

sum += i;

}

def i = 2;

def sum = 0;

while (i > 0) {

sum = sum + i;

i--;

}

def year = doc[\"year\"].value;

if (year < 1800) {

return 1.0

} else if (year < 1900) {

return 2.0

} else {

return year - 1000

}

저작자표시

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델 (0)	2017.07.29
[elasticsearch5] 집계 (aggregation) 성능 향상 (0)	2017.07.26
[elasticsearch5] 삭제된 api 확인하기 (0)	2017.07.24
[elasticsearch5] 축소(shrink) api 예제 (0)	2017.06.19
[일래스틱서치5] 트라이브(tribe) 노드는 7.0부터 사라질 예정 (0)	2017.06.12

Posted by '김용환'

,

[redis] lua 사용 사례

Redis 2017. 7. 25. 16:28

redis에 lua를 사용할 수 있다. 언제 쓰면 좋을까?

애플리케이션-Redis 구조에서는 특별히 사용할 일이 없어보이지만,

Transmission Time(Latency)을 최대한 늦추고 Network Bandwith를 줄이고 싶을 수 있을 때 사용할 수 있다.

여러번 또는 수십 번 통신 비용을 쓰기 보다 한 번의 콜로 해결할 수 있다면 통신 비용을 아낄 수 있을 때, 사용하는 것이 좋을 것 같다.

레디스 자료 중에 관련 lua 사례가 있다.

https://github.com/RedisLabs/geo.lua의 코드이다.

여러번 반복적인 커맨드를 사용해서 데이터를 읽어와야 하는 경우에..

geodist key elem1 elem2

geodist key elem3 elem4

geodist key elem5 elem6

이를 lua 코드로 geopathlen key elem1 elem2 elem3, elem4, elem5, elem6로 한번에 호출할 수 있다.

그리고 리스트의 랜덤 결과 값을 얻어온다고 할 때도 도움이 될 수 있을 것이다.

이외에 WATH/MULTI/DISCARD/EXEC 와 같은 트랙잭션 코드를 lua로 쉽게 해결할 수 있을 것이다.

아래는 RedisLab에서 발표한 Lua 사례 내용이다.

Redis: Lua scripts - a primer and use cases from Redis Labs

저작자표시

'Redis' 카테고리의 다른 글

[펌] 2018.5 일본 자바 유저 그룹 세미나 - 라인 자료 (0)	2018.05.31
[redis] keys가 위험하긴 하지만, range 계열 커맨드도 위험할 수 있다. (0)	2017.04.24
[redis] sorted set 커맨드 (0)	2017.02.06
[redis] redis 프로세스의 설정 파일이 안보이는 부분. (0)	2017.01.02
[redis] redis 3.2 에서 포트 접근시 DENIED Redis is running in protected mode 발생 (0)	2016.12.31

Posted by '김용환'

,

'2017/07'에 해당되는 글 43건

[elasticsearch5] 핫 스레드 (hot thread) api

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] 루씬 6.0의 유사도 모델 / 일래스틱서치의 유사도 모델 설정 방법

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] phrase 쿼리에 사용할 수 있는 3가지 스무딩(smoothing) 모델

Smoothing Modelsedit

'Elasticsearch' 카테고리의 다른 글

조세특례제한법 제99조의 2 (2013년 매매 부동산)

'부동산' 카테고리의 다른 글

[scala] scalatest에서 Exception 처리

'scala' 카테고리의 다른 글

[scala] scalablitz

'scala' 카테고리의 다른 글

[hive] reducer에 메모리 할당하기

hive.exec.reducers.bytes.per.reducer

hive.exec.reducers.max

mapred.reduce.tasks

'hadoop' 카테고리의 다른 글

[elasticsearch5] 집계 (aggregation) 성능 향상

'Elasticsearch' 카테고리의 다른 글

[elasticsearch5] elasticsearch scripting 역사

'Elasticsearch' 카테고리의 다른 글

[redis] lua 사용 사례

'Redis' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바