sqoop에서 무거운 DB덤프 잡을 빨리 하려면 다음 옵션을 고려하길 바란다. 확실히 빨라진다. 10분 배치를 1분대로..
1) mapper 메모리는 크게
-Dmapreduce.map.memory.mb=(크게, 그러나 적절하게) -Dmapreduce.map.java.opts=-Xmx(크게, 그러나 적절하게)
2) mapper 개수는 많이
--num-mappers (크게, 그러나 적절하게)
3) split-by와 $CONDITIONS
--split-by id : 쪼개는 컬럼 이름
$CONDITIONS : 내부 튜닝 값
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS
which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by
.
For example:
$ sqoop import \ --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ --split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by specifying a single map task with -m 1
:
$ sqoop import \ --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ -m 1 --target-dir /user/foo/joinresults
4) --boundary-query <sql 문>
'hadoop' 카테고리의 다른 글
[hadoop] Exit Code 154 (0) | 2019.06.04 |
---|---|
sqoop 성능 튜닝 (0) | 2019.01.11 |
얀(yarn) 기반 spark 애플리케이션 종료 방법 (0) | 2019.01.11 |
[hadoop] No lease on .. File does not exist. (0) | 2018.11.30 |
[phoenix] 피닉스의 timestamp 타입 값을 현재시간으로 확인하기 (0) | 2018.11.20 |
[Hbase-Phoenix] phoenix.schema.isNamespaceMappingEnabled, Cannot create schema because config phoenix.schema.isNamespaceMappingEnabled for enabling name space mapping isn't enabled 에러 발생 (0) | 2018.11.01 |
댓글을 달아 주세요