MSLAB (MemStore-Local Allocation Buffers, http://knight76.tistory.com/entry/hbase-MSLAB-MemStoreLocal-Allocation-Buffer-공부) 는 Hbase의 write buffer인 mem store기반위에 byte[]를 이용하여 구현했다. 

반면, Hbase의 read buffer인 block cache인 여러 구현체를 잘 사용함으로서, read성능을 최대로 낼 수 있다. 

LruBlockCache가 대표적이긴 하지만 BucketCache, SlabCache(SingleSizeCache), SimpleBlockCache, BucketCache와 LruBlockCache를 섞어놓은 CombinedBlockCache, LruBlockCache와 SlabCache을 섞어놓은 DoubleBlockCache 등이 있어 다양하게 적용해볼 수 있다. 


LruBlockCache는 java heap을 쓰기 때문에 gc 이슈가 크다는 단점이 있다. 이를 위한 튜닝이 필수적이다. 디폴트로 heap의 25%을 지정하도록 되어 있다. 파편화를 줄이기 위해서 fixed된 메모리를 사용하는 SlabCache를 사용할 수 있지만, Slab memory를 사용하는 나름의 단점(get/cache시 memory copy)이 있다. 

Buckcache는 캐쉬 공간을 라벨링을 할 수 있고, size를 지정할 수 있는 bucket으로 나누어 캐쉬를 관리할 수 있다. 


사용자가 어떤 CacheBlock을 쓸 것인지 결정하기 위해서는 문서가 없어서 소스를 참조해야 했다. CacheConfig 클래스를 참조해야 한다. 13번, 14번을 참조하면 된다. 


2013.6.7일에 작성된 것으로 해당 링크를 기준으로 작성되었다. 설정 데이터의 경우가 코드가 빠르게 진행되는터라, 문서에는 늦게 반영되는 듯 하다. 



0. Block 소개

Block은 DataBlock (64K, storage K-V), BloomBlock(128K, storage BloomFilter data), IndexBlock(128K, index datat) 로 구성되어 있다. BloomBlock과 IndexBlock은 거의 100% 사용하기 때문에 MetaBlock이라 불린다. 캐쉬된 block을 Block Cache라 하고,  read performance를 위해 존재한다. 



1. read & write buffer


http://nosql.mypopescu.com/post/18943894052/what-hbase-learned-from-the-hypertable-vs-hbase



2. http://hbase.apache.org/book/config.files.html


hfile.block.cache.size

Percentage of maximum heap (-Xmx setting) to allocate to block cache used by HFile/StoreFile. Default of 0.25 means allocate 25%. Set to 0 to disable but it's not recommended.

Default: 0.25


hbase.rs.cacheblocksonwrite

Whether an HFile block should be added to the block cache when the block is finished.

Default: false


hfile.block.index.cacheonwrite

This allows to put non-root multi-level index blocks into the block cache at the time the index is being written.

Default: false


hfile.block.bloom.cacheonwrite

Enables cache-on-write for inline blocks of a compound Bloom filter.

Default: false



3. http://hbase.apache.org/book/important_configurations.html


2.5.3.2. Disabling Blockcache

Do not turn off block cache (You'd do it by setting hbase.block.cache.size to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage).





4.http://hbase.apache.org/book/regionserver.arch.html

9.6.4. Block Cache

9.6.4.1. Design

The Block Cache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:

  • Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions. The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
  • Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority. It is thus part of the second group considered during evictions.
  • In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed. Catalog tables are configured like this. This group is the last one considered during evictions.

For more information, see the LruBlockCache source

9.6.4.2. Usage

Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache. This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance. An important concept is the working set size, or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.

The way to calculate how much memory is available in HBase for caching is:

            number of region servers * heap size * hfile.block.cache.size * 0.85
        

The default value for the block cache is 0.25 which represents 25% of the available heap. The last value (85%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples:

  • One region server with the default heap size (1GB) and the default block cache size will have 217MB of block cache available.
  • 20 region servers with the heap size set to 8GB and a default block cache size will have 34GB of block cache.
  • 100 region servers with the heap size set to 24GB and a block cache size of 0.5 will have about 1TB of block cache.

Your data isn't the only resident of the block cache, here are others that you may have to take into account:

  • Catalog tables: The -ROOT- and .META. tables are forced into the block cache and have the in-memory priority which means that they are harder to evict. The former never uses more than a few hundreds of bytes while the latter can occupy a few MBs (depending on the number of regions).
  • HFiles indexes: HFile is the file format that HBase uses to store data in HDFS and it contains a multi-layered index in order seek to the data without having to read the whole file. The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing. For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
  • Keys: Taking into account only the values that are being stored is missing half the picture since every value is stored along with its keys (row key, family, qualifier, and timestamp). SeeSection 6.3.2, “Try to minimize row and column sizes”.
  • Bloom filters: Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.

Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics. For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.

It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases:

  • Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0. Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM. For more information on monitoring GC, seeSection 12.2.3, “JVM Garbage Collection Logs”.
  • Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache. The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access. An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use.



5. http://hbase.apache.org/book.html#perf.hbase.client.blockcache


11.9.5. Block Cache

Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache.



6. https://issues.apache.org/jira/browse/HBASE-4027


Enable direct byte buffers LruBlockCache 제목이며, 0.92.0 부터 적용

Setting -XX:MaxDirectMemorySize in hbase-env.sh enables this feature. The file already has a line you can uncomment and you need to set the size of the direct memory (your total memory - size allocated to memstores - size allocated to the normal block cache - some head room for the other functionalities).


Description

Java offers the creation of direct byte buffers which are allocated outside of the heap.

They need to be manually free'd, which can be accomplished using an documented clean method.

The feature will be optional. After implementing, we can benchmark for differences in speed and garbage collection observances.


7. http://hbase.apache.org/book.html#trouble.client.oome.directmemory.leak

12.5.6. Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)

You are likely running into the issue that is described and worked through in the mail thread HBase, mail # user - Suspected memory leak and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. A workaround is passing your client-side JVM a reasonable value for -XX:MaxDirectMemorySize. By default, the MaxDirectMemorySize is equal to your -Xmx max heapsize setting (if -Xmx is set). Try seting it to something smaller (for example, one user had success setting it to 1g when they had a client-side heap of 12g). If you set it too small, it will bring on FullGCs so keep it a bit hefty. You want to make this setting client-side only especially if you are running the new experiemental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).




8. https://issues.apache.org/jira/browse/HBASE-6312

Make BlockCache eviction thresholds configurable

0.95.0에 적용 


Release Note:

From now on, the block cache will use all the memory it's given as its upper bound was raised from 85% to 99%. The lower bound for evictions, called "minimum factor", was raised from 75% to 95% and is now configurable via "hbase.lru.blockcache.min.factor". This means that 4% of the block cache is evicted at a time instead of 10%, so evictions may run more often but each will be less disruptive. 


Description

Some of our customers found that tuning the BlockCache eviction thresholds made test results different in their test environment. However, those thresholds are not configurable in the current implementation. The only way to change those values is to re-compile the HBase source code. We wonder if it is possible to make them configurable.



9. https://issues.apache.org/jira/browse/HBASE-7404


Bucket Cache:A solution about CMS,Heap Fragment and Big Cache on HBASE :  0.95.0에 적용




  • Release Note:
    BucketCache is another implementation of BlockCache which supports big block cache for high performance and would greatly decrease CMS and heap fragmentation in JVM caused by read activities. 


    Usage: 

    1.Use bucket cache as main memory cache, configured as the following: 
    –"hbase.bucketcache.ioengine" "heap" 
    –"hbase.bucketcache.size" 0.4 (size for bucket cache, 0.4 is a percentage of max heap size) 

    2.Use bucket cache as a secondary cache, configured as the following: 
    –"hbase.bucketcache.ioengine" "file:/disk1/hbase/cache.data"(The file path where to store the block data) 
    –"hbase.bucketcache.size" 1024 (size for bucket cache, unit is MB, so 1024 means 1GB) 
    –"hbase.bucketcache.combinedcache.enabled" false (default value being true) 


Description

First, thanks @neil from Fusion-IO share the source code.

Usage:

1.Use bucket cache as main memory cache, configured as the following:
–"hbase.bucketcache.ioengine" "heap"
–"hbase.bucketcache.size" 0.4 (size for bucket cache, 0.4 is a percentage of max heap size)

2.Use bucket cache as a secondary cache, configured as the following:
–"hbase.bucketcache.ioengine" "file:/disk1/hbase/cache.data"(The file path where to store the block data)
–"hbase.bucketcache.size" 1024 (size for bucket cache, unit is MB, so 1024 means 1GB)
–"hbase.bucketcache.combinedcache.enabled" false (default value being true)

See more configurations from org.apache.hadoop.hbase.io.hfile.CacheConfig and org.apache.hadoop.hbase.io.hfile.bucket.BucketCache

What's Bucket Cache? 
It could greatly decrease CMS and heap fragment by GC
It support a large cache space for High Read Performance by using high speed disk like Fusion-io

1.An implementation of block cache like LruBlockCache
2.Self manage blocks' storage position through Bucket Allocator
3.The cached blocks could be stored in the memory or file system
4.Bucket Cache could be used as a mainly block cache(see CombinedBlockCache), combined with LruBlockCache to decrease CMS and fragment by GC.
5.BucketCache also could be used as a secondary cache(e.g. using Fusionio to store block) to enlarge cache space

How about SlabCache?
We have studied and test SlabCache first, but the result is bad, because:
1.SlabCache use SingleSizeCache, its use ratio of memory is low because kinds of block size, especially using DataBlockEncoding
2.SlabCache is uesd in DoubleBlockCache, block is cached both in SlabCache and LruBlockCache, put the block to LruBlockCache again if hit in SlabCache , it causes CMS and heap fragment don't get any better
3.Direct heap performance is not good as heap, and maybe cause OOM, so we recommend using "heap" engine

See more in the attachment and in the patch



10. http://zoomq.qiniudn.com/ZQScrapBook/ZqFLOSS/data/20130319094323/index.html

중국 사람이 작성한 자료라 보기 힘들지만, 영문으로 번역한 내용은 겨우 읽을만함



11. http://www.venturesquare.net/514286

app between에서는 hfile.block.cache.size 값을 0.5로 할당하고 사용하고 있음


12. http://www.marshut.com/kikq/does-hbase-regionserver-benefit-from-os-page-cache.html

관련해서 재미있는 내용


13. hbase의 CacheConfig.java 소스중 일부 -

소스를 통해서 Block Cache 초기화하면서 어떤 설정이 block cache 정책을 결정하는지 알 수 있다. 


 instantiateBlockCache



  /**

   * Returns the block cache or <code>null</code> in case none should be used.

   *

   * @param conf  The current configuration.

   * @return The block cache or <code>null</code>.

   */

  private static synchronized BlockCache instantiateBlockCache(Configuration conf) {

    if (globalBlockCache != null) return globalBlockCache;

    if (blockCacheDisabled) return null;


    float cachePercentage = conf.getFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY,

      HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT);

    if (cachePercentage == 0L) {

      blockCacheDisabled = true;

      return null;

    }

    if (cachePercentage > 1.0) {

      throw new IllegalArgumentException(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY +

        " must be between 0.0 and 1.0, and not > 1.0");

    }


    // Calculate the amount of heap to give the heap.

    MemoryUsage mu = ManagementFactory.getMemoryMXBean().getHeapMemoryUsage();

    long lruCacheSize = (long) (mu.getMax() * cachePercentage);

    int blockSize = conf.getInt("hbase.offheapcache.minblocksize", HConstants.DEFAULT_BLOCKSIZE);

    long offHeapCacheSize =

      (long) (conf.getFloat("hbase.offheapcache.percentage", (float) 0) *

          DirectMemoryUtils.getDirectMemorySize());

    if (offHeapCacheSize <= 0) {

      String bucketCacheIOEngineName = conf.get(BUCKET_CACHE_IOENGINE_KEY, null);

      float bucketCachePercentage = conf.getFloat(BUCKET_CACHE_SIZE_KEY, 0F);

      // A percentage of max heap size or a absolute value with unit megabytes

      long bucketCacheSize = (long) (bucketCachePercentage < 1 ? mu.getMax()

          * bucketCachePercentage : bucketCachePercentage * 1024 * 1024);


      boolean combinedWithLru = conf.getBoolean(BUCKET_CACHE_COMBINED_KEY,

          DEFAULT_BUCKET_CACHE_COMBINED);

      BucketCache bucketCache = null;

      if (bucketCacheIOEngineName != null && bucketCacheSize > 0) {

        int writerThreads = conf.getInt(BUCKET_CACHE_WRITER_THREADS_KEY,

            DEFAULT_BUCKET_CACHE_WRITER_THREADS);

        int writerQueueLen = conf.getInt(BUCKET_CACHE_WRITER_QUEUE_KEY,

            DEFAULT_BUCKET_CACHE_WRITER_QUEUE);

        String persistentPath = conf.get(BUCKET_CACHE_PERSISTENT_PATH_KEY);

        float combinedPercentage = conf.getFloat(

            BUCKET_CACHE_COMBINED_PERCENTAGE_KEY,

            DEFAULT_BUCKET_CACHE_COMBINED_PERCENTAGE);

        if (combinedWithLru) {

          lruCacheSize = (long) ((1 - combinedPercentage) * bucketCacheSize);

          bucketCacheSize = (long) (combinedPercentage * bucketCacheSize);

        }

        try {

          int ioErrorsTolerationDuration = conf.getInt(

              "hbase.bucketcache.ioengine.errors.tolerated.duration",

              BucketCache.DEFAULT_ERROR_TOLERATION_DURATION);

          bucketCache = new BucketCache(bucketCacheIOEngineName,

              bucketCacheSize, writerThreads, writerQueueLen, persistentPath,

              ioErrorsTolerationDuration);

        } catch (IOException ioex) {

          LOG.error("Can't instantiate bucket cache", ioex);

          throw new RuntimeException(ioex);

        }

      }

      LOG.info("Allocating LruBlockCache with maximum size " +

        StringUtils.humanReadableInt(lruCacheSize));

      LruBlockCache lruCache = new LruBlockCache(lruCacheSize, StoreFile.DEFAULT_BLOCKSIZE_SMALL);

      lruCache.setVictimCache(bucketCache);

      if (bucketCache != null && combinedWithLru) {

        globalBlockCache = new CombinedBlockCache(lruCache, bucketCache);

      } else {

        globalBlockCache = lruCache;

      }

    } else {

      globalBlockCache = new DoubleBlockCache(lruCacheSize, offHeapCacheSize,

          StoreFile.DEFAULT_BLOCKSIZE_SMALL, blockSize, conf);

    }

    return globalBlockCache;

  }

}


14. hbase 설정

hbase configuration 페이지는 참조할 수 없지만, 소스상에 있는 설정값을 통해 적절한 값을 유지할 수 있도록 설정이 존재 


* LruBlockCache

hbase.lru.blockcache. min.factor

hbase.lru.blockcache.acceptable.factor


* SlabCache

hbase.offheapcache.slab.proportions

hbase.offheapcache.slab.sizes


* BucketCache

hbase.bucketcache.ioengine

hbase.bucketcache.size

hbase.bucketcache.persistent.path

hbase.bucketcache.combinedcache.enabled

hbase.bucketcache.percentage.in.combinedcache

hbase.bucketcache.writer.threads

hbase.bucketcache.writer.queuelength


* Off heap cache : heap 영역관련 설정 변수

hbase.offheapcache.minblocksize

hbase.offheapcache.percentage


* compressed

 hbase.rs.blockcache.cachedatacompressed : Configuration key to cache data blocks in compressed format.


hbase.rs.evictblocksonclose : Configuration key to evict all blocks of a given file from the block cache when the file is closed.



15. DirectMemory

http://coders.talend.com/sites/default/files/heapoff-wtf_OlivierLamy.pdf

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.1/org/apache/hadoop/hbase/util/DirectMemoryUtils.java

32public class DirectMemoryUtils {

  

Returns:

the setting of -XX:MaxDirectMemorySize as a long. Returns 0 if -XX:MaxDirectMemorySize is not set.

36

37

38  public static long getDirectMemorySize() {

39    RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean();

40    List<String> arguments = RuntimemxBean.getInputArguments();

41    long multiplier = 1; //for the byte case.

42    for (String s : arguments) {

43      if (s.contains("-XX:MaxDirectMemorySize=")) {

44        String memSize = s.toLowerCase()

45            .replace("-xx:maxdirectmemorysize=", "").trim();

46

47        if (memSize.contains("k")) {

48          multiplier = 1024;

49        }

50

51        else if (memSize.contains("m")) {

52          multiplier = 1048576;

53        }

54

55        else if (memSize.contains("g")) {

56          multiplier = 1073741824;

57        }

58        memSize = memSize.replaceAll("[^\\d]", "");

59

60        long retValue = Long.parseLong(memSize);

61        return retValue * multiplier;

62      }

63

64    }

65    return 0;

66  }


  

DirectByteBuffers are garbage collected by using a phantom reference and a reference queue. Every once a while, the JVM checks the reference queue and cleans the DirectByteBuffers. However, as this doesn't happen immediately after discarding all references to a DirectByteBuffer, it's easy to OutOfMemoryError yourself using DirectByteBuffers. This function explicitly calls the Cleaner method of a DirectByteBuffer.

Parameters:

toBeDestroyed The DirectByteBuffer that will be "cleaned". Utilizes reflection.

79

80  public static void destroyDirectByteBuffer(ByteBuffer toBeDestroyed)

81      throws IllegalArgumentException, IllegalAccessException,

82      InvocationTargetException, SecurityException, NoSuchMethodException {

83

84    Preconditions.checkArgument(toBeDestroyed.isDirect(),

85        "toBeDestroyed isn't direct!");

86

87    Method cleanerMethod = toBeDestroyed.getClass().getMethod("cleaner");

88    cleanerMethod.setAccessible(true);

89    Object cleaner = cleanerMethod.invoke(toBeDestroyed);

90    Method cleanMethod = cleaner.getClass().getMethod("clean");

91    cleanMethod.setAccessible(true);

92    cleanMethod.invoke(cleaner);

93

94  }

95}


15. 관련 소스

src/main/java/org/apache/hadoop/hbase/io/hfile 

src/main/java/org/apache/hadoop/hbase/io/hfile/slab/

src/main/java/org/apache/hadoop/hbase/util/DirectMemoryUtils.java

src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java

src/test/java/org/apache/hadoop/hbase/io/hfile/

src/test/java/org/apache/hadoop/hbase/io/hfile/slab/

Posted by '김용환'
,