Elasticsearch

[elasticsearch5] elasticsearch for hadoop(하둡 연동 일레스틱서치)

'김용환' 2017. 8. 23. 15:55


elasticsearch과 hadoop을 바인딩해서 사용할 수 있다.


https://www.elastic.co/guide/en/elasticsearch/hadoop/current/reference.html



요구사항은 다음과 같다. 특이점음 1.x부터 5.5까지 하위 호환성은 유지시켜 준닥.


https://www.elastic.co/guide/en/elasticsearch/hadoop/5.5/requirements.html





하이브 주요 내용은 다음과 같다.

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html



다음과 비슷하게 hive 테이블을 생성할 수 있다.


CREATE EXTERNAL TABLE IF NOT EXISTS artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

              'es.index.auto.create' = 'false') ;




주요 하이브 테이블 설정은 다음과 같다. 


https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html




1. 'es.resource'=하둡 디렉토리


2. 'es.input.json` = 'yes'

json 입력을 가능케 해준다. 기본값은 false이다. 문서에는 true가 아니라 'yes'라고 입력되어 있다.


es.input.json (default false)
Whether the input is already in JSON format or not (the default). Please see the appropriate section of each integration for more details about using JSON directly.



3. 'es.nodes'='${target_es_dns}'

elasticsearch node는 대부분 클러스터이기 때문에. es.node를 사용할 때는 해당 노드로 저장케 한다.



es.nodes (default localhost)
List of Elasticsearch nodes to connect to. When using Elasticsearch remotely, do set this option. Note that the list does not have to contain every node inside the Elasticsearch cluster; these are discovered automatically by elasticsearch-hadoop by default (see below). Each node can also have its HTTP/REST port specified individually (e.g. mynode:9600).


4. 'es.mapping.id' = did


document ID 로 사용한 다큐먼트 필드/프로퍼티

es.mapping.id (default none)
The document field/property name containing the document id.


5. 'es.query' = '?q=me*'

쿼리 결과도 저장할 수 있다. 



6. batch 관련 설정

'es.batch.write.refresh'='false'

'es.batch.size.bytes'='10mb'

'es.batch.size.entries'='0'




es.batch.size.bytes (default 1mb)
Size (in bytes) for batch writes using Elasticsearch bulk API. Note the bulk size is allocated per taskinstance. Always multiply by the number of tasks within a Hadoop job to get the total bulk size at runtime hitting Elasticsearch.
es.batch.size.entries (default 1000)
Size (in entries) for batch writes using Elasticsearch bulk API - (0 disables it). Companion to es.batch.size.bytes, once one matches, the batch update is executed. Similar to the size, this setting is per task instance; it gets multiplied at runtime by the total number of Hadoop tasks running.
es.batch.write.refresh (default true)
Whether to invoke an index refresh or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.
es.batch.write.retry.count (default 3)
Number of retries for a given batch in case Elasticsearch is overloaded and data is rejected. Note that only the rejected data is retried. If there is still data rejected after the retries have been performed, the Hadoop job is cancelled (and fails). A negative value indicates infinite retries; be careful in setting this value as it can have unwanted side effects.