일래스틱서치 6.x부터 APM XPACK을 사용할 수 있다고 한다. 


https://www.elastic.co/blog/elastic-apm-ga-released?elektra=home&storm=banner



 



Posted by '김용환'
,



elasticsearch에서 pagination을 하고 싶다면, search_after를 사용한다.


https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html


위 url에서 제공하는 간단한 예제는 다음과 같다. 


GET twitter/tweet/_search

{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}






여러 날짜별로 분리된 인덱스에서도 search_after를 사용할 수 있다.  


다음 예제는 search_after를 활용할 수 있도록 2개의 필드로 소팅하고 2개의 특정 값으로 pagination하는 예제이다. 




curl -XGET 'http://search.google.io:9200/plus.-*/_search?pretty' -H 'Content-Type: application/json' -d'

{  

   "version":true,

   "query":{  

      "bool":{  

         "must":[  

            {  

               "match_all":{        }

            },

            {  

               "range":{  

                  "@timestamp":{  

                     "gte":"2018-01-31T16:52:07+09:00",

                     "lte":"2018-01-31T17:52:07+09:00"

                  }

               }

            }

         ]

      }

   },

   "size":11,

   "sort":[  

      {  

         "@timestamp":{  

            "order":"desc"

         }

      },

      {  

         "_uuid.keyword":{  

            "order":"asc"

         }

      }

   ]

}

'



curl -XGET 'http://search.google.io:9200/plus-*/_search?pretty' -H 'Content-Type: application/json' -d'

{  

   "version":true,

   "query":{  

      "bool":{  

         "must":[  

            {  

               "match_all":{  


               }

            },

            {  

               "range":{  

                  "@timestamp":{  

                     "gte":"2018-01-31T16:52:07+09:00",

                     "lte":"2018-01-31T17:52:07+09:00"

                  }

               }

            }

         ]

      }

   },

   "size":11,

   "search_after":[  

      "1517388213000",

      "f938251211a3"

   ],

   "sort":[  

      {  

         "@timestamp":{  

            "order":"desc"

         }

      },

      {  

         "_uuid.keyword":{  

            "order":"asc"

         }

      }

   ]

}

'




Posted by '김용환'
,

일래스틱서치에 uuid로 소팅하고 싶은데, 에러가 발생했다. 


(쿼리)

 curl -XGET 'localhost:9200/google_plus_data/_search?pretty' -H 'Content-Type: application/json' -d'

{

  "query": { "match_all": {} },

  "size": 10,

  "sort": [ 

            { "@timestamp": "desc" } ,

            { "_uuid": "asc" } 

          ]

}'




(에러)


"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [_uuid] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."




에러 관련해서 문서를 보면 먼가 작업을 해야 할 것 같지만..

https://www.elastic.co/guide/en/elasticsearch/reference/5.0/fielddata.html


uuid는 keyword라는 필드가 있어서.. 사용하면 작 동작한다.



 curl -XGET 'localhost:9200/google_plus_data/_search?pretty' -H 'Content-Type: application/json' -d'

{

  "query": { "match_all": {} },

  "size": 10,

  "sort": [ 

            { "@timestamp": "desc" } ,

            { "_uuid.keyword": "asc" } 

          ]

}'





Posted by '김용환'
,


elasticsearch의 5.x에는 기존의 무거웠던 from+size와 scroll api를 대체하는 search_after가 있다. 

전에 scroll을 써보면서 얼마나 무겁고,메모리가 이슈였다.(state기반)

search_after는 live cursor를 제공하면서 stateless이기 때문에 성능을 최적화할 수 있는 것 같다.


https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-request-search-after.html#search-request-search-after


Pagination of results can be done by using the from and size but the cost becomes prohibitive when the deep pagination is reached. The index.max_result_window which defaults to 10,000 is a safeguard, search requests take heap memory and time proportional to from + size. The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor. The idea is to use the results from the previous page to help the retrieval of the next page.





search_after를 사용해 특정 시간 범위의 HTTP라는 값을 찾고 pagination을 하고 싶은 때의 예제이다.


처음 시도할 때는 search_after를 시도하지 않고 ordering을 잘해야 한다. 동일 시간대에 uid를 기준으로 진행한다.



$ curl log.google.io
:9200/log-2017.11.14/logstash/_search?pretty=true -d '{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_all": "*HTTP*"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2017-11-14T11:00:00+09:00",
              "lte": "2017-11-14T12:00:00+09:00"
            }
          }
        }
      ]
    }
  },
  "size":3,
  "sort":[
      {
          "@timestamp":{"order":"asc"},
          "_uid": { "order": "desc" }
      }
   ]

}'





확인하면 다음과 같은 결과가 나올 것이다. 



{
  "took" : 94,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 488876,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "log-2017.11.14",
        "_type" : "logstash",
        "_id" : "log-1d8a4f45-a15f-41bf-95f2-399c88844b60",
        "_score" : null,
        "_source" : {
          "pid" : "13434",
          "severity" : "INFO",
          "ident" : "nova.metadata.wsgi.server",
          "message" : "[-] 1.1.1.1 \"GET /latest/meta-data/ami-id HTTP/1.1\" status: 200 len: 129 time: 0.0031130",
          "hostname" : "eastzone-pg1-api001",
          "node" : "openstack_control",
          "type" : "nova_api",
          "phase" : "pg1",
          "@timestamp" : "2017-11-14T11:00:00+09:00",
          "_uuid" : "log-1d8a4f45-a15f-41bf-95f2-399c88844b60",
        },
        "sort" : [
          1510624800000,
          "logstash#log-1d8a4f45-a15f-41bf-95f2-399c88844b60"
        ]
      },
      {
        "_index" : "log-2017.11.14",
        "_type" : "logstash",
        "_id" : "log-ae067d5d-0d90-4414-ad6c-1c7cf286f95c",
        "_score" : null,
        "_source" : {
          "pid" : "4386",
          "severity" : "INFO",
          "ident" : "nova.osapi_compute.wsgi.server",
          "message" : "[-] 10.197.12.118,10.60.19.248 \"GET /v2/1ba72e3bdbe8491ba851f2f9fb4eb6f1/servers/a6c14368-1816-4908-8cab-0b97fed31f40/ips HTTP/1.1\" status: 401 len: 297 time: 0.0048220",
          "hostname" : "eastzone-pg1-api004",
          "node" : "openstack_control",
          "type" : "nova_api",
          "phase" : "pg1",
          "@timestamp" : "2017-11-14T11:00:00+09:00",
          "_uuid" : "log-ae067d5d-0d90-4414-ad6c-1c7cf286f95c",
        },
        "sort" : [
          1510624800000,
          "logstash#log-ae067d5d-0d90-4414-ad6c-1c7cf286f95c"
        ]
      },
      {
        "_index" : "log-2017.11.14",
        "_type" : "logstash",
        "_id" : "log-9ddf30a2-cfc1-4b76-b50b-5f4bcae02a94",
        "_score" : null,
        "_source" : {
          "pid" : "4431",
          "severity" : "INFO",
          "ident" : "nova.metadata.wsgi.server",
          "message" : "[-] 10.60.34.9,10.60.19.248 \"GET /latest/meta-data/public-ipv4 HTTP/1.1\" status: 200 len: 116 time: 0.0032690",
          "hostname" : "eastzone-pg1-api004",
          "node" : "openstack_control",
          "type" : "nova_api",
          "phase" : "pg1",
          "@timestamp" : "2017-11-14T11:00:00+09:00",
          "_uuid" : "log-9ddf30a2-cfc1-4b76-b50b-5f4bcae02a94",
        },
        "sort" : [
          1510624800000,
          "logstash#log-9ddf30a2-cfc1-4b76-b50b-5f4bcae02a94"
        ]
      }
    ]
  }




맨 마지막 uuid를 기반으로 uid를 생성한다. uid는 type과 uuid를 합친 것이다. (디폴트로 uuid는 사용할 수 없다. 검색안되고, 인덱스도 안됨)



$ curl log.google.io:9200/log-2017.11.14/logstash/_search?pretty=true -d '{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_all": "*HTTP*"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2017-11-14T11:00:00+09:00",
              "lte": "2017-11-14T12:00:00+09:00"
            }
          }
        }
      ]
    }
  },
  "search_after": ["1510624800000", "logstash#log-9ddf30a2-cfc1-4b76-b50b-5f4bcae02a94"],
  "size":3,
  "sort":[
      {
          "@timestamp":{"order":"asc"},
          "_uid": { "order": "desc" }
      }
   ]

}'





결과 잘 나옴.. 







Posted by '김용환'
,


elasticsearch에서 _id로 검색 하지만 실패한다. 

_id 필드는 기본적으로 인덱스도 저장도 하지 않는다.  따라서 _uid를 대신 사용해야 한다.



  "error" : {

    "root_cause" : [

      {

        "type" : "illegal_argument_exception",

        "reason" : "Fielddata is not supported on field [_id] of type [_id]"

      }

    ],

Posted by '김용환'
,


elasticsearch client 라이브러를 실행 도중에 다음 에러가 발생했다.



ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.



해결 방법은 다음 URL에 있다. 


https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_log4j_2_logger.html


다음을 pom파일에 추가하고 


<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.9.1</version>
</dependency>



src/main/resource 디렉토리에 log4j2.properties을 추가한다.


appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout

rootLogger.level = info
rootLogger.appenderRef.console.ref = console


Posted by '김용환'
,


일래스틱서치의 기본 인덱스의 필드 최대 개수는 1000이다.


만약 아래와 같은 에러가 발생했다면 다음 내용을 참조한다. 



Limit of total fields [1000] in index [인덱스 이름] has been exceeded.






https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Settings to prevent mappings explosionedit

Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from. This problem may be more common than expected. As an example, consider a situation in which every new document inserted introduces new fields. This is quite common with dynamic mappings. Every time a document contains new fields, those will end up in the index’s mappings. This isn’t worrying for a small amount of data, but it can become a problem as the mapping grows. The following settings allow you to limit the number of field mappings that can be created manually or dynamically, in order to prevent bad documents from causing a mapping explosion:

index.mapping.total_fields.limit
The maximum number of fields in an index. The default value is 1000.
index.mapping.depth.limit
The maximum depth for a field, which is measured as the number of inner objects. For instance, if all fields are defined at the root object level, then the depth is 1. If there is one object mapping, then the depth is 2, etc. The default is 20.
index.mapping.nested_fields.limit
The maximum number of nested fields in an index, defaults to 50. Indexing 1 document with 100 nested fields actually indexes 101 documents as each nested document is indexed as a separate hidden document.


필드 개수를 늘리는 것에 대해서 문서 뉘앙스는 bad이니 어느 정도부터는 성능 상 분명 존재할 것 같기도 하다..


굳이 변경하려면 다음과 같이 진행한다.



PUT 'localhost:9200/myindex/_settings' -d '

{

  "index.mapping.total_fields.limit": 10000

}'




그러나, 될 수 있으면 조심해서 쓰라고 하니. 참고하길 바란다.


https://discuss.elastic.co/t/limit-of-total-fields-1000-in-index-has-been-exceeded/87280/2


->(Elastic Team Member) Just understand that more fields is more overhead and that sparse fields cause trouble. So raise it with caution. Bumping into the limit is likely a sign that you are doing something that isn't going to work well with Elasticsearch in the future.


Posted by '김용환'
,


펌질


https://www.elastic.co/blog/hot-warm-architecture




Master nodes

We recommend running 3 dedicated master nodes per cluster having dedicated master nodes which run in their own JVM increases stability and resilience as they are not affected by garbage collection that can affect other types of nodes. These nodes do not handle requests and do not hold any data, and therefore only require less resources (such as CPU, RAM and Disk)

Hot data nodes

Hot data nodes are designed to perform all indexing within the cluster and will also hold the most recent daily indices that generally tend to be queried most frequently. As indexing is a very IO intensive operation, these servers need to be powerful and backed by attached SSD storage.


Warm data nodes

Warm data nodes are designed to handle a large amount of read-only indices that are not queried frequently. As these indices are read-only, warm nodes tend to util



PUT /logs_2015-08-31 { "settings": { "index.routing.allocation.require.box_type" : "hot" } }





Posted by '김용환'
,




일래스틱서치는 인덱스의 필드 데이터의 데이터 타입을 지정하지 않으면 자동으로 데이터 타입 매핑을 한다. 그러나 데이터 오류 또는 일래스틱서치 내부 동작에 의해 좀더 큰(broader) 타입으로 자동 매핑을 하지 않도록 동적으로 일래스틱서치의 필드에 타입을 지정하는 기능이다. 



예를 들어, integer 값 필드를 일래스틱서치 내부적으로 integer 타입로 받아들이다가 데이터 일부가 오류가 생겨서 1.1이 오면 모두 double로 변경될 수 있다. 자동 타입 변경이 되지 않도록 설정할 수 있는 기능이라 할 수 있다. 



https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#dynamic-templates


 "dynamic_templates": [
    {
      "my_template_name": { 
        ...  match conditions ... 
        "mapping": { ... } 


Posted by '김용환'
,


elasticsearch과 hadoop을 바인딩해서 사용할 수 있다.


https://www.elastic.co/guide/en/elasticsearch/hadoop/current/reference.html



요구사항은 다음과 같다. 특이점음 1.x부터 5.5까지 하위 호환성은 유지시켜 준닥.


https://www.elastic.co/guide/en/elasticsearch/hadoop/5.5/requirements.html





하이브 주요 내용은 다음과 같다.

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html



다음과 비슷하게 hive 테이블을 생성할 수 있다.


CREATE EXTERNAL TABLE IF NOT EXISTS artists (...)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'radio/artists',

              'es.index.auto.create' = 'false') ;




주요 하이브 테이블 설정은 다음과 같다. 


https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html




1. 'es.resource'=하둡 디렉토리


2. 'es.input.json` = 'yes'

json 입력을 가능케 해준다. 기본값은 false이다. 문서에는 true가 아니라 'yes'라고 입력되어 있다.


es.input.json (default false)
Whether the input is already in JSON format or not (the default). Please see the appropriate section of each integration for more details about using JSON directly.



3. 'es.nodes'='${target_es_dns}'

elasticsearch node는 대부분 클러스터이기 때문에. es.node를 사용할 때는 해당 노드로 저장케 한다.



es.nodes (default localhost)
List of Elasticsearch nodes to connect to. When using Elasticsearch remotely, do set this option. Note that the list does not have to contain every node inside the Elasticsearch cluster; these are discovered automatically by elasticsearch-hadoop by default (see below). Each node can also have its HTTP/REST port specified individually (e.g. mynode:9600).


4. 'es.mapping.id' = did


document ID 로 사용한 다큐먼트 필드/프로퍼티

es.mapping.id (default none)
The document field/property name containing the document id.


5. 'es.query' = '?q=me*'

쿼리 결과도 저장할 수 있다. 



6. batch 관련 설정

'es.batch.write.refresh'='false'

'es.batch.size.bytes'='10mb'

'es.batch.size.entries'='0'




es.batch.size.bytes (default 1mb)
Size (in bytes) for batch writes using Elasticsearch bulk API. Note the bulk size is allocated per taskinstance. Always multiply by the number of tasks within a Hadoop job to get the total bulk size at runtime hitting Elasticsearch.
es.batch.size.entries (default 1000)
Size (in entries) for batch writes using Elasticsearch bulk API - (0 disables it). Companion to es.batch.size.bytes, once one matches, the batch update is executed. Similar to the size, this setting is per task instance; it gets multiplied at runtime by the total number of Hadoop tasks running.
es.batch.write.refresh (default true)
Whether to invoke an index refresh or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.
es.batch.write.retry.count (default 3)
Number of retries for a given batch in case Elasticsearch is overloaded and data is rejected. Note that only the rejected data is retried. If there is still data rejected after the retries have been performed, the Hadoop job is cancelled (and fails). A negative value indicates infinite retries; be careful in setting this value as it can have unwanted side effects.





Posted by '김용환'
,