[hbase] Phoenix 성능

nosql 2013. 6. 24. 11:36


출처 : 피닉스 블로그

http://phoenix-hbase.blogspot.kr/2013/05/demystifying-skip-scan-in-phoenix.html?m=1



hbase의 read performance를 높이는 것이 hbase 큰 화두이다. essential cf(https://issues.apache.org/jira/browse/HBASE-5416), (https://issues.apache.org/jira/browse/HBASE-8316) 를 이용하여 성능을 높이려는 작업을 hbase에서 진행했다.


피닉스는 hbase에 올라가는 SQL인데. hive와 비슷한 오픈소스로서 요즘 많이 추천을 받고 있다. 

어중간하게 Scanner를 잘못써서 full scan하는 것보다  reseek을 이용한 피닉스를 써서 성능을 높이는 것이 좋을 수 있다. 아래 reseek를 통해서 성능을 높였다는 블로그가 있어서 공유한다. 


http://archive.cloudera.com/cdh4/cdh/4/hbase/apidocs/org/apache/hadoop/hbase/regionserver/KeyValueScanner.html#reseek(org.apache.hadoop.hbase.KeyValue)

reseek

boolean reseek(KeyValue key)
               throws IOException
Reseek the scanner at or after the specified KeyValue. This method is guaranteed to seek at or after the required key only if the key comes after the current position of the scanner. Should not be used to seek to a key which may come before the current position.

Parameters:
key - seek value (should be non-null)
Returns:
true if scanner has values left, false if end of scanner
Throws:
IOException





in memory test

TestTime
Phoenix1.7 sec
Batched Gets4.0 sec



disk read test

TestTime
Phoenix37 sec
Batched Gets82 sec
Range Scan12 mins
Hive over HBase20+ mins




Posted by '김용환'
,



hbase에서는 ROOT catalog table을 사용하고 있었다. 그러나. 0.95, 0.98부터는 ROOT catalog table을 사용하지 않고, .META table을 직접 읽고 쓰도록  바뀌었다. 원래 ROOT영역은 여러개의 meta region을 지원하기 위해서 만든건데. 현실적으로 Hbase는 심플하게 하나의 meta region만 쓰기 때문에 바뀌었다. 여러개의 meta region이 존재하게 된다면 그때 다시 ROOT table이 생길듯 하다. 



1. HBASE-3171(https://issues.apache.org/jira/browse/HBASE-3171)



Rather than storing the ROOT region location in ZooKeeper, going to ROOT, and reading the META location, we should just store the META location directly in ZooKeeper.

The purpose of the root region from the bigtable paper was to support multiple meta regions. Currently, we explicitly only support a single meta region, so the translation from our current code of a single root location to a single meta location will be very simple. Long-term, it seems reasonable that we could store several meta region locations in ZK. There's been some discussion in HBASE-1755 about actually moving META into ZK, but I think this jira is a good step towards taking some of the complexity out of how we have to deal with catalog tables everywhere.

As-is, a new client already requires ZK to get the root location, so this would not change those requirements in any way.

The primary motivation for this is to simplify things like CatalogTracker. The way we can handle root in that class is really simple but the tracking of meta is difficulty and a bit hacky. This hack on tracking of the meta location is what caused one of the bugs over in HBASE-3159.



2. 현재 hbase 문서에서는 ROOT의 내용이 여전히 남아 있다. 그러나 조만간에 삭제될 예정이다. 


http://hbase.apache.org/book/arch.catalog.html


9.2. Catalog Tables

The catalog tables -ROOT- and .META. exist as HBase tables. They are filtered out of the HBase shell's list command, but they are in fact tables just like any other.

9.2.1. ROOT

-ROOT- keeps track of where the .META. table is. The -ROOT- table structure is as follows:

Key:

  • .META. region key (.META.,,1)

Values:

  • info:regioninfo (serialized HRegionInfo instance of .META.)
  • info:server (server:port of the RegionServer holding .META.)
  • info:serverstartcode (start-time of the RegionServer process holding .META.)




'nosql' 카테고리의 다른 글

[hbase] hbase shell의 prompt  (0) 2013.07.05
[hbase] Phoenix 성능  (0) 2013.06.24
[hbase] 제한적인 트랙잭션  (0) 2013.06.20
[hbase] shell에서 major compation 지원  (0) 2013.06.19
hbase - block, block cache 공부  (0) 2013.06.07
Posted by '김용환'
,

Hbase에서는 single row operation단위(한 region server에서 한 single row 단위로)로 transaction이 제한적으로 가능하다. 관련 내용을 공부했다. 


1.   MultiRowMutationProtocol.mutateRows()  이용



HBASE-5304(https://issues.apache.org/jira/browse/HBASE-5304), HBASE-5368(https://issues.apache.org/jira/browse/HBASE-5368), HBASE-5229(https://issues.apache.org/jira/browse/HBASE-5229)을 통해서 prefix row key을 동일한 이름을 주어  한 Region에 co-located시킬 수 있고  atomic operation을 할 수 있게 한다. 





http://hadoop-hbase.blogspot.kr/2012_02_01_archive.html 의 구현 예제 내용을 그대로 배낌


An example can be setup as follows:

1. add this to hbase-site.xml:
  <property>
    <name>hbase.coprocessor.user.region.classes</name>
    <value>org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint</value>
  </property>

This loads the necessary coprocessor endpoint into all regions of all user tables.

2. Create a table that uses KeyPrefixRegionSplitPolicy:
HTableDescriptor myHtd = new HTableDescriptor("myTable");
myHtd.setValue(HTableDescriptor.SPLIT_POLICY,
               KeyPrefixRegionSplitPolicy.class.getName()); 

// set the prefix to 3 in this example
myHtd.setValue(KeyPrefixRegionSplitPolicy.PREFIX_LENGTH_KEY, 
               String.valueOf(3));
HColumnDescriptor hcd = new HColumnDescriptor(...);
myHtd.addFamily(hcd);
HBaseAdmin admin = ...;
admin.createTable(myHtd);

Regions of "myTable" are now split according to KeyPrefixRegionSplitPolicy with a prefix of 3.


3. Execute an atomic multirow transaction
List<Mutation> ms = new ArrayList<Mutation>();
Put p = new Put(Bytes.toBytes("xxxabc"));
...
ms.add(p);
Put p1 = new Put(
Bytes.toBytes("xxx123"));
...
ms.add(p1);
Delete d = new Delete(Bytes.toBytes("xxxzzz"));
...
ms.add(d);
// get a proxy for MultiRowMutationEndpoint
// Note that the passed row is used to locate the 
// region. Any of the other row keys could have
// been used as well, as long as they identify 
// the same region.
MultiRowMutationProtocol mr = t.coprocessorProxy(
   MultiRowMutationProtocol.class,
   Bytes.toBytes("xxxabc"));
// perform the atomic operation 
mr.mutateRows(ms);



2. HTable.mutateRow(RowMutations)

하나의 Row를 기반으로 put/delete만 모은 작업을 입력된 순서대로 atomic하게 작업(예전에는 increment도 보장을 요청했지만 그 부분은 제외됨. increment는 무조건 단위로 사용해야 함)


HBASE-3584(https://issues.apache.org/jira/browse/HBASE-3584) 와 HBASE-5203(

https://issues.apache.org/jira/browse/HBASE-5203


http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/RowMutations.html

Performs multiple mutations atomically on a single row. Currently Put and Delete are supported. The mutations are performed in the order in which they were added.

We compare and equate mutations based off their row so be careful putting RowMutations into Sets or using them as keys in Maps.



public void testColumnFamilyDeleteRM() throws Exception { 

        HTableInterface table = hTablePool.getTable(tableName); 

        try { 

            RowMutations rm =new RowMutations(row); 

            Delete delete = new Delete(row); 

            delete.deleteFamily(cf1); 

            rm.add(delete); 

            System.out.println("Added delete of cf1 column family to row mutation"); 


            Put put = new Put(row); 

            put.add(cf1, Bytes.toBytes("c1"), Bytes.toBytes("new_v1")); 

            put.add(cf1, Bytes.toBytes("c11"), Bytes.toBytes("new_v11")); 

            rm.add(put); 

            System.out.println("Added puts of cf1 column family to row mutation"); 


            table.mutateRow(rm); 

            System.out.println("Mutated row"); 


            Result result = table.get(new Get(row)); 

            NavigableMap<byte[], byte[]> familyMap = result.getFamilyMap(cf1); 


            Assert.assertNotNull(familyMap);

            Assert.assertEquals(2, familyMap.size()); 

        } finally { 

            table.close(); 

        } 

    } 

    


'nosql' 카테고리의 다른 글

[hbase] Phoenix 성능  (0) 2013.06.24
[hbase] ROOT catalog table 삭제  (0) 2013.06.20
[hbase] shell에서 major compation 지원  (0) 2013.06.19
hbase - block, block cache 공부  (0) 2013.06.07
hbase 사용사례 발견  (0) 2013.06.07
Posted by '김용환'
,


0.95.2, 0.94.9 버전부터 shell에서 major compation 명령을 내릴 수 있다. 


Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \

[-compactOnce] [-major] [-mapred] [-D<property=value>]* files...


중요 코드

public class CompactionTool extends Configured implements Tool {

 /**

     * Execute the actual compaction job.

     * If the compact once flag is not specified, execute the compaction until

     * no more compactions are needed. Uses the Configuration settings provided.

     */

    private void compactStoreFiles(final Path tableDir, final HTableDescriptor htd,

        final HRegionInfo hri, final String familyName, final boolean compactOnce,

        final boolean major) throws IOException {

      HStore store = getStore(conf, fs, tableDir, htd, hri, familyName, tmpDir);

      LOG.info("Compact table=" + htd.getNameAsString() +

        " region=" + hri.getRegionNameAsString() +

        " family=" + familyName);

      if (major) {

        store.triggerMajorCompaction();

      }

      do {

        CompactionContext compaction = store.requestCompaction(Store.PRIORITY_USER, null);

        if (compaction == null) break;

        List<StoreFile> storeFiles = store.compact(compaction);

        if (storeFiles != null && !storeFiles.isEmpty()) {

          if (keepCompactedFiles && deleteCompacted) {

            for (StoreFile storeFile: storeFiles) {

              fs.delete(storeFile.getPath(), false);

            }

          }

        }

      } while (store.needsCompaction() && !compactOnce);

    }


더 깊이 보면, 내부소스상 CompactionContext의 select 메소드를 호출한다.



출처 :

Add major compaction support in CompactionTool

https://issues.apache.org/jira/browse/HBASE-8683


'nosql' 카테고리의 다른 글

[hbase] ROOT catalog table 삭제  (0) 2013.06.20
[hbase] 제한적인 트랙잭션  (0) 2013.06.20
hbase - block, block cache 공부  (0) 2013.06.07
hbase 사용사례 발견  (0) 2013.06.07
[hbase] MSLAB (MemStore-Local Allocation Buffer) 공부  (1) 2013.06.05
Posted by '김용환'
,




MSLAB (MemStore-Local Allocation Buffers, http://knight76.tistory.com/entry/hbase-MSLAB-MemStoreLocal-Allocation-Buffer-공부) 는 Hbase의 write buffer인 mem store기반위에 byte[]를 이용하여 구현했다. 

반면, Hbase의 read buffer인 block cache인 여러 구현체를 잘 사용함으로서, read성능을 최대로 낼 수 있다. 

LruBlockCache가 대표적이긴 하지만 BucketCache, SlabCache(SingleSizeCache), SimpleBlockCache, BucketCache와 LruBlockCache를 섞어놓은 CombinedBlockCache, LruBlockCache와 SlabCache을 섞어놓은 DoubleBlockCache 등이 있어 다양하게 적용해볼 수 있다. 


LruBlockCache는 java heap을 쓰기 때문에 gc 이슈가 크다는 단점이 있다. 이를 위한 튜닝이 필수적이다. 디폴트로 heap의 25%을 지정하도록 되어 있다. 파편화를 줄이기 위해서 fixed된 메모리를 사용하는 SlabCache를 사용할 수 있지만, Slab memory를 사용하는 나름의 단점(get/cache시 memory copy)이 있다. 

Buckcache는 캐쉬 공간을 라벨링을 할 수 있고, size를 지정할 수 있는 bucket으로 나누어 캐쉬를 관리할 수 있다. 


사용자가 어떤 CacheBlock을 쓸 것인지 결정하기 위해서는 문서가 없어서 소스를 참조해야 했다. CacheConfig 클래스를 참조해야 한다. 13번, 14번을 참조하면 된다. 


2013.6.7일에 작성된 것으로 해당 링크를 기준으로 작성되었다. 설정 데이터의 경우가 코드가 빠르게 진행되는터라, 문서에는 늦게 반영되는 듯 하다. 



0. Block 소개

Block은 DataBlock (64K, storage K-V), BloomBlock(128K, storage BloomFilter data), IndexBlock(128K, index datat) 로 구성되어 있다. BloomBlock과 IndexBlock은 거의 100% 사용하기 때문에 MetaBlock이라 불린다. 캐쉬된 block을 Block Cache라 하고,  read performance를 위해 존재한다. 



1. read & write buffer


http://nosql.mypopescu.com/post/18943894052/what-hbase-learned-from-the-hypertable-vs-hbase



2. http://hbase.apache.org/book/config.files.html


hfile.block.cache.size

Percentage of maximum heap (-Xmx setting) to allocate to block cache used by HFile/StoreFile. Default of 0.25 means allocate 25%. Set to 0 to disable but it's not recommended.

Default: 0.25


hbase.rs.cacheblocksonwrite

Whether an HFile block should be added to the block cache when the block is finished.

Default: false


hfile.block.index.cacheonwrite

This allows to put non-root multi-level index blocks into the block cache at the time the index is being written.

Default: false


hfile.block.bloom.cacheonwrite

Enables cache-on-write for inline blocks of a compound Bloom filter.

Default: false



3. http://hbase.apache.org/book/important_configurations.html


2.5.3.2. Disabling Blockcache

Do not turn off block cache (You'd do it by setting hbase.block.cache.size to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage).





4.http://hbase.apache.org/book/regionserver.arch.html

9.6.4. Block Cache

9.6.4.1. Design

The Block Cache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:

  • Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions. The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
  • Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority. It is thus part of the second group considered during evictions.
  • In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed. Catalog tables are configured like this. This group is the last one considered during evictions.

For more information, see the LruBlockCache source

9.6.4.2. Usage

Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache. This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance. An important concept is the working set size, or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.

The way to calculate how much memory is available in HBase for caching is:

            number of region servers * heap size * hfile.block.cache.size * 0.85
        

The default value for the block cache is 0.25 which represents 25% of the available heap. The last value (85%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples:

  • One region server with the default heap size (1GB) and the default block cache size will have 217MB of block cache available.
  • 20 region servers with the heap size set to 8GB and a default block cache size will have 34GB of block cache.
  • 100 region servers with the heap size set to 24GB and a block cache size of 0.5 will have about 1TB of block cache.

Your data isn't the only resident of the block cache, here are others that you may have to take into account:

  • Catalog tables: The -ROOT- and .META. tables are forced into the block cache and have the in-memory priority which means that they are harder to evict. The former never uses more than a few hundreds of bytes while the latter can occupy a few MBs (depending on the number of regions).
  • HFiles indexes: HFile is the file format that HBase uses to store data in HDFS and it contains a multi-layered index in order seek to the data without having to read the whole file. The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing. For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
  • Keys: Taking into account only the values that are being stored is missing half the picture since every value is stored along with its keys (row key, family, qualifier, and timestamp). SeeSection 6.3.2, “Try to minimize row and column sizes”.
  • Bloom filters: Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.

Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics. For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.

It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases:

  • Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0. Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM. For more information on monitoring GC, seeSection 12.2.3, “JVM Garbage Collection Logs”.
  • Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache. The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access. An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use.



5. http://hbase.apache.org/book.html#perf.hbase.client.blockcache


11.9.5. Block Cache

Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache.



6. https://issues.apache.org/jira/browse/HBASE-4027


Enable direct byte buffers LruBlockCache 제목이며, 0.92.0 부터 적용

Setting -XX:MaxDirectMemorySize in hbase-env.sh enables this feature. The file already has a line you can uncomment and you need to set the size of the direct memory (your total memory - size allocated to memstores - size allocated to the normal block cache - some head room for the other functionalities).


Description

Java offers the creation of direct byte buffers which are allocated outside of the heap.

They need to be manually free'd, which can be accomplished using an documented clean method.

The feature will be optional. After implementing, we can benchmark for differences in speed and garbage collection observances.


7. http://hbase.apache.org/book.html#trouble.client.oome.directmemory.leak

12.5.6. Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)

You are likely running into the issue that is described and worked through in the mail thread HBase, mail # user - Suspected memory leak and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. A workaround is passing your client-side JVM a reasonable value for -XX:MaxDirectMemorySize. By default, the MaxDirectMemorySize is equal to your -Xmx max heapsize setting (if -Xmx is set). Try seting it to something smaller (for example, one user had success setting it to 1g when they had a client-side heap of 12g). If you set it too small, it will bring on FullGCs so keep it a bit hefty. You want to make this setting client-side only especially if you are running the new experiemental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).




8. https://issues.apache.org/jira/browse/HBASE-6312

Make BlockCache eviction thresholds configurable

0.95.0에 적용 


Release Note:

From now on, the block cache will use all the memory it's given as its upper bound was raised from 85% to 99%. The lower bound for evictions, called "minimum factor", was raised from 75% to 95% and is now configurable via "hbase.lru.blockcache.min.factor". This means that 4% of the block cache is evicted at a time instead of 10%, so evictions may run more often but each will be less disruptive. 


Description

Some of our customers found that tuning the BlockCache eviction thresholds made test results different in their test environment. However, those thresholds are not configurable in the current implementation. The only way to change those values is to re-compile the HBase source code. We wonder if it is possible to make them configurable.



9. https://issues.apache.org/jira/browse/HBASE-7404


Bucket Cache:A solution about CMS,Heap Fragment and Big Cache on HBASE :  0.95.0에 적용




  • Release Note:
    BucketCache is another implementation of BlockCache which supports big block cache for high performance and would greatly decrease CMS and heap fragmentation in JVM caused by read activities. 


    Usage: 

    1.Use bucket cache as main memory cache, configured as the following: 
    –"hbase.bucketcache.ioengine" "heap" 
    –"hbase.bucketcache.size" 0.4 (size for bucket cache, 0.4 is a percentage of max heap size) 

    2.Use bucket cache as a secondary cache, configured as the following: 
    –"hbase.bucketcache.ioengine" "file:/disk1/hbase/cache.data"(The file path where to store the block data) 
    –"hbase.bucketcache.size" 1024 (size for bucket cache, unit is MB, so 1024 means 1GB) 
    –"hbase.bucketcache.combinedcache.enabled" false (default value being true) 


Description

First, thanks @neil from Fusion-IO share the source code.

Usage:

1.Use bucket cache as main memory cache, configured as the following:
–"hbase.bucketcache.ioengine" "heap"
–"hbase.bucketcache.size" 0.4 (size for bucket cache, 0.4 is a percentage of max heap size)

2.Use bucket cache as a secondary cache, configured as the following:
–"hbase.bucketcache.ioengine" "file:/disk1/hbase/cache.data"(The file path where to store the block data)
–"hbase.bucketcache.size" 1024 (size for bucket cache, unit is MB, so 1024 means 1GB)
–"hbase.bucketcache.combinedcache.enabled" false (default value being true)

See more configurations from org.apache.hadoop.hbase.io.hfile.CacheConfig and org.apache.hadoop.hbase.io.hfile.bucket.BucketCache

What's Bucket Cache? 
It could greatly decrease CMS and heap fragment by GC
It support a large cache space for High Read Performance by using high speed disk like Fusion-io

1.An implementation of block cache like LruBlockCache
2.Self manage blocks' storage position through Bucket Allocator
3.The cached blocks could be stored in the memory or file system
4.Bucket Cache could be used as a mainly block cache(see CombinedBlockCache), combined with LruBlockCache to decrease CMS and fragment by GC.
5.BucketCache also could be used as a secondary cache(e.g. using Fusionio to store block) to enlarge cache space

How about SlabCache?
We have studied and test SlabCache first, but the result is bad, because:
1.SlabCache use SingleSizeCache, its use ratio of memory is low because kinds of block size, especially using DataBlockEncoding
2.SlabCache is uesd in DoubleBlockCache, block is cached both in SlabCache and LruBlockCache, put the block to LruBlockCache again if hit in SlabCache , it causes CMS and heap fragment don't get any better
3.Direct heap performance is not good as heap, and maybe cause OOM, so we recommend using "heap" engine

See more in the attachment and in the patch



10. http://zoomq.qiniudn.com/ZQScrapBook/ZqFLOSS/data/20130319094323/index.html

중국 사람이 작성한 자료라 보기 힘들지만, 영문으로 번역한 내용은 겨우 읽을만함



11. http://www.venturesquare.net/514286

app between에서는 hfile.block.cache.size 값을 0.5로 할당하고 사용하고 있음


12. http://www.marshut.com/kikq/does-hbase-regionserver-benefit-from-os-page-cache.html

관련해서 재미있는 내용


13. hbase의 CacheConfig.java 소스중 일부 -

소스를 통해서 Block Cache 초기화하면서 어떤 설정이 block cache 정책을 결정하는지 알 수 있다. 


 instantiateBlockCache



  /**

   * Returns the block cache or <code>null</code> in case none should be used.

   *

   * @param conf  The current configuration.

   * @return The block cache or <code>null</code>.

   */

  private static synchronized BlockCache instantiateBlockCache(Configuration conf) {

    if (globalBlockCache != null) return globalBlockCache;

    if (blockCacheDisabled) return null;


    float cachePercentage = conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY,

      HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT);

    if (cachePercentage == 0L) {

      blockCacheDisabled = true;

      return null;

    }

    if (cachePercentage > 1.0) {

      throw new IllegalArgumentException(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY +

        " must be between 0.0 and 1.0, and not > 1.0");

    }


    // Calculate the amount of heap to give the heap.

    MemoryUsage mu = ManagementFactory.getMemoryMXBean().getHeapMemoryUsage();

    long lruCacheSize = (long) (mu.getMax() * cachePercentage);

    int blockSize = conf.getInt("hbase.offheapcache.minblocksize", HConstants.DEFAULT_BLOCKSIZE);

    long offHeapCacheSize =

      (long) (conf.getFloat("hbase.offheapcache.percentage", (float) 0) *

          DirectMemoryUtils.getDirectMemorySize());

    if (offHeapCacheSize <= 0) {

      String bucketCacheIOEngineName = conf.get(BUCKET_CACHE_IOENGINE_KEY, null);

      float bucketCachePercentage = conf.getFloat(BUCKET_CACHE_SIZE_KEY, 0F);

      // A percentage of max heap size or a absolute value with unit megabytes

      long bucketCacheSize = (long) (bucketCachePercentage < 1 ? mu.getMax()

          * bucketCachePercentage : bucketCachePercentage * 1024 * 1024);


      boolean combinedWithLru = conf.getBoolean(BUCKET_CACHE_COMBINED_KEY,

          DEFAULT_BUCKET_CACHE_COMBINED);

      BucketCache bucketCache = null;

      if (bucketCacheIOEngineName != null && bucketCacheSize > 0) {

        int writerThreads = conf.getInt(BUCKET_CACHE_WRITER_THREADS_KEY,

            DEFAULT_BUCKET_CACHE_WRITER_THREADS);

        int writerQueueLen = conf.getInt(BUCKET_CACHE_WRITER_QUEUE_KEY,

            DEFAULT_BUCKET_CACHE_WRITER_QUEUE);

        String persistentPath = conf.get(BUCKET_CACHE_PERSISTENT_PATH_KEY);

        float combinedPercentage = conf.getFloat(

            BUCKET_CACHE_COMBINED_PERCENTAGE_KEY,

            DEFAULT_BUCKET_CACHE_COMBINED_PERCENTAGE);

        if (combinedWithLru) {

          lruCacheSize = (long) ((1 - combinedPercentage) * bucketCacheSize);

          bucketCacheSize = (long) (combinedPercentage * bucketCacheSize);

        }

        try {

          int ioErrorsTolerationDuration = conf.getInt(

              "hbase.bucketcache.ioengine.errors.tolerated.duration",

              BucketCache.DEFAULT_ERROR_TOLERATION_DURATION);

          bucketCache = new BucketCache(bucketCacheIOEngineName,

              bucketCacheSize, writerThreads, writerQueueLen, persistentPath,

              ioErrorsTolerationDuration);

        } catch (IOException ioex) {

          LOG.error("Can't instantiate bucket cache", ioex);

          throw new RuntimeException(ioex);

        }

      }

      LOG.info("Allocating LruBlockCache with maximum size " +

        StringUtils.humanReadableInt(lruCacheSize));

      LruBlockCache lruCache = new LruBlockCache(lruCacheSize, StoreFile.DEFAULT_BLOCKSIZE_SMALL);

      lruCache.setVictimCache(bucketCache);

      if (bucketCache != null && combinedWithLru) {

        globalBlockCache = new CombinedBlockCache(lruCache, bucketCache);

      } else {

        globalBlockCache = lruCache;

      }

    } else {

      globalBlockCache = new DoubleBlockCache(lruCacheSize, offHeapCacheSize,

          StoreFile.DEFAULT_BLOCKSIZE_SMALL, blockSize, conf);

    }

    return globalBlockCache;

  }

}


14. hbase 설정

hbase configuration 페이지는 참조할 수 없지만, 소스상에 있는 설정값을 통해 적절한 값을 유지할 수 있도록 설정이 존재 


* LruBlockCache

hbase.lru.blockcache. min.factor

hbase.lru.blockcache.acceptable.factor


* SlabCache

hbase.offheapcache.slab.proportions

hbase.offheapcache.slab.sizes


* BucketCache

hbase.bucketcache.ioengine

hbase.bucketcache.size

hbase.bucketcache.persistent.path

hbase.bucketcache.combinedcache.enabled

hbase.bucketcache.percentage.in.combinedcache

hbase.bucketcache.writer.threads

hbase.bucketcache.writer.queuelength


* Off heap cache : heap 영역관련 설정 변수

hbase.offheapcache.minblocksize

hbase.offheapcache.percentage


* compressed

 hbase.rs.blockcache.cachedatacompressed : Configuration key to cache data blocks in compressed format.


hbase.rs.evictblocksonclose : Configuration key to evict all blocks of a given file from the block cache when the file is closed.



15. DirectMemory

http://coders.talend.com/sites/default/files/heapoff-wtf_OlivierLamy.pdf

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.1/org/apache/hadoop/hbase/util/DirectMemoryUtils.java

32public class DirectMemoryUtils {

  

Returns:

the setting of -XX:MaxDirectMemorySize as a long. Returns 0 if -XX:MaxDirectMemorySize is not set.

36

37

38  public static long getDirectMemorySize() {

39    RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean();

40    List<String> arguments = RuntimemxBean.getInputArguments();

41    long multiplier = 1; //for the byte case.

42    for (String s : arguments) {

43      if (s.contains("-XX:MaxDirectMemorySize=")) {

44        String memSize = s.toLowerCase()

45            .replace("-xx:maxdirectmemorysize=", "").trim();

46

47        if (memSize.contains("k")) {

48          multiplier = 1024;

49        }

50

51        else if (memSize.contains("m")) {

52          multiplier = 1048576;

53        }

54

55        else if (memSize.contains("g")) {

56          multiplier = 1073741824;

57        }

58        memSize = memSize.replaceAll("[^\\d]", "");

59

60        long retValue = Long.parseLong(memSize);

61        return retValue * multiplier;

62      }

63

64    }

65    return 0;

66  }


  

DirectByteBuffers are garbage collected by using a phantom reference and a reference queue. Every once a while, the JVM checks the reference queue and cleans the DirectByteBuffers. However, as this doesn't happen immediately after discarding all references to a DirectByteBuffer, it's easy to OutOfMemoryError yourself using DirectByteBuffers. This function explicitly calls the Cleaner method of a DirectByteBuffer.

Parameters:

toBeDestroyed The DirectByteBuffer that will be "cleaned". Utilizes reflection.

79

80  public static void destroyDirectByteBuffer(ByteBuffer toBeDestroyed)

81      throws IllegalArgumentException, IllegalAccessException,

82      InvocationTargetException, SecurityException, NoSuchMethodException {

83

84    Preconditions.checkArgument(toBeDestroyed.isDirect(),

85        "toBeDestroyed isn't direct!");

86

87    Method cleanerMethod = toBeDestroyed.getClass().getMethod("cleaner");

88    cleanerMethod.setAccessible(true);

89    Object cleaner = cleanerMethod.invoke(toBeDestroyed);

90    Method cleanMethod = cleaner.getClass().getMethod("clean");

91    cleanMethod.setAccessible(true);

92    cleanMethod.invoke(cleaner);

93

94  }

95}


15. 관련 소스

src/main/java/org/apache/hadoop/hbase/io/hfile 

src/main/java/org/apache/hadoop/hbase/io/hfile/slab/

src/main/java/org/apache/hadoop/hbase/util/DirectMemoryUtils.java

src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java

src/test/java/org/apache/hadoop/hbase/io/hfile/

src/test/java/org/apache/hadoop/hbase/io/hfile/slab/

Posted by '김용환'
,

hbase 사용사례 발견

nosql 2013. 6. 7. 10:58


연인의 채팅 서비스 http://appbetween.us/ko/ 에서 hbase를 사용하고 있다. 


http://www.venturesquare.net/48056
http://www.venturesquare.net/514286



Posted by '김용환'
,



1. http://www.slideshare.net/cloudera/hbase-hug-presentation

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers from Cloudera, Inc.


2. http://hbase.apache.org/book.html#hbase.hregion.memstore.mslab.enabled

hbase.hregion.memstore.mslab.enabled

Enables the MemStore-Local Allocation Buffer, a feature which works to prevent heap fragmentation under heavy write loads. This can reduce the frequency of stop-the-world GC pauses on large heaps.

Default: true




3. http://hbase.apache.org/book/upgrade0.92.html


3.4.2. MSLAB is ON by default

In 0.92.0, the hbase.hregion.memstore.mslab.enabled flag is set to true (See Section 11.3.1.1, “Long GC pauses”). In 0.90.x it was false. When it is enabled, memstores will step allocate memory in MSLAB 2MB chunks even if the memstore has zero or just a few small elements. This is fine usually but if you had lots of regions per regionserver in a 0.90.x cluster (and MSLAB was off), you may find yourself OOME'ing on upgrade because the thousands of regions * number of column families * 2MB MSLAB (at a minimum) puts your heap over the top. Set hbase.hregion.memstore.mslab.enabled to false or set the MSLAB size down from 2MB by settinghbase.hregion.memstore.mslab.chunksize to something less.



4. http://hbase.apache.org/book/jvm.html#mslab

11.3.1. The Garbage Collector and Apache HBase

11.3.1.1. Long GC pauses

In his presentation, Avoiding Full GCs with MemStore-Local Allocation Buffers, Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading; CMS failure modes and old generation heap fragmentation brought. To address the first, start the CMS earlier than default by adding -XX:CMSInitiatingOccupancyFraction and setting it down from defaults. Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used). To address the second fragmentation issue, Todd added an experimental facility, , that must be explicitly enabled in Apache HBase 0.90.x (Its defaulted to be on in Apache 0.92.x HBase). See hbase.hregion.memstore.mslab.enabled to true in your Configuration. See the cited slides for background and detail[25]. Be aware that when enabled, each MemStore instance will occupy at least an MSLAB instance of memory. If you have thousands of regions or lots of regions each with many column families, this allocation of MSLAB may be responsible for a good portion of your heap allocation and in an extreme case cause you to OOME. Disable MSLAB in this case, or lower the amount of memory it uses or float less regions per server.



5.

http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/

http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-2/

http://blog.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/



The Best News of All

After producing the above graph, I let the insert workload run overnight, and then continued for several days. In all of this time, there was not a single GC pause that lasted longer than a second. The fragmentation problem was completely solved for this workload!

How to try it

The MSLAB allocation scheme is available in Apache HBase 0.90.1, and part of CDH3 Beta 4 released last week. Since it is relatively new, it is not yet enabled by default, but it can be configured using the following flags:

ConfigurationDescription
hbase.hregion.memstore.mslab.enabledSet to true to enable this feature
hbase.hregion.memstore.mslab.chunksizeThe size of the chunks allocated by MSLAB, in bytes (default 2MB)
hbase.hregion.memstore.mslab.max.allocationThe maximum size byte array that should come from the MSLAB, in bytes (default 256KB)


6. http://inking007.tistory.com/entry/Performance-Tuning-Hbase


7. AppBetween 쪽 정보
http://www.venturesquare.net/514286
gc 영향을 거의 적게 했다고 함(0.2초)

8. API & 소스
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/MemStoreLAB.html
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.94.0/org/apache/hadoop/hbase/regionserver/MemStoreLAB.java


Posted by '김용환'
,


기존 rdb 데이터를 hadoop (hbase, hive)으로 import/export시켜 주는 open source (apache top project 중 하나이다. ) 툴을 발견했다.


Sqoop이라는 툴이다. http://sqoop.apache.org/ 


1. import to hdfs storea




2. export to hdfs 



Apache Sqoop: A Data Transfer Tool for Hadoop from Cloudera, Inc.


참조 

1. https://blog.cloudera.com/blog/2011/10/apache-sqoop-overview/ 

2. https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

3. http://www.slideshare.net/cloudera/apache-sqoop-a-data-transfer-tool-for-hadoop


Posted by '김용환'
,


웹 서버의 경우는 socket개수 를 위해서 nproc, ulimit -n을 수정하는 경우가 있다.

그러나, hbase는 실제 파일과 관련된 작업이 있다보니, 신경을 써야 한다. 


hbase의 column family는 최소한 한개의 store fiile을 사용한다. 부하상황일때는 6개까지 사용할 수 있다. 만약 column family 한 개당 평균 3개의 store file 을 사용하고, store file이 100개의 regions에 걸쳐있다면. 총 300(1 * 3 * 100) 개의 file descripter를 오픈할 것이다. 


만약 3개의 column family를 동시에 사용하고 있고, 부하상황일떄는 1800(3 * 6 * 100)개를 hbase 가 동시에 사용할 수 있다.  



 /etc/security/limits.conf

hadoop  -       nofile  32768

hadoop soft/hard nproc 32000



*참조
http://hbase.apache.org/book.html


Posted by '김용환'
,



Hadoop and HBase Optimization for Read Intensive Search Applications 이란 Intel의 포스트를 통해서 Bizosys라는 회사가 어떻게 Read Performance를 향상시킨 사례가 있어서  정리했다. 


1. SSD 사용

2. hadoop,hbase에서 사용하는 zlib 을 Intel 병렬 프로세스에 최적화할 수 있도록 소스 컴파일하여 preview 부분과 index 압축을 최적화함

3. hadoop & hbase 설정 변경


Configuration FilePropertyDescriptionValue
hdfs-site.xmldfs.block.sizeLower value offers parallelism.33554432
dfs.datanode.handler.countNumber of handlers dedicated to serve data block requests in hadoop DataNodes.100
core-site.xmlio.file.buffer.sizeThis is the read and write buffer size. By setting limit to 16KB it allows continuous streaming.16384
hbase-site.xmlhbase.regionserver.handler.countRPC Server instances spun up on HBase RegionServers100
hfile.min.blocksize.sizeSmall size increases the index but reduces the lesser fetch on a random access.65536
default.xmlSCAN_IPC_CACHE_LIMITNumber of rows cached in Bizosys search engine for each scanner next call over the wire. It reduces the network round trip by 300 times caching 300 rows in each trip.300
LOCAL_JOB_HANDLER_COUNTNumber of parallel queries executed at one go. Query requests above than this limit gets queued up.100


4. jvm옵션
1) heap : 4G
2) gc : "-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:+AggressiveHeap -XX:+HeapDumpOnOutOfMemoryError" 

5. Hbase에 customized index 사용하여 효과를 얻음

<얻은 효과>
1.  IPC 최적화는 33% 성능을 좋아졌다.
2. IPC 최적화 적용상태에서 JVM GC 튜닝은 16% 좋아졌다.
3. GC 튜닝후, Custom index를 적용했더니 62.5% 좋아졌다. 
4. Custom Index 적용후, SSD 적용했더니 66% 좋아졌다.

cache eviction를 위해서 Block Cache는 전체 Heap (4G)의 20% (800M)로 지정. 


Posted by '김용환'
,