sqoop에서 무거운 DB덤프 잡을 빨리 하려면 다음 옵션을 고려하길 바란다. 확실히 빨라진다. 10분 배치를 1분대로..
1) mapper 메모리는 크게
-Dmapreduce.map.memory.mb=(크게, 그러나 적절하게) -Dmapreduce.map.java.opts=-Xmx(크게, 그러나 적절하게)
2) mapper 개수는 많이
--num-mappers (크게, 그러나 적절하게)
3) split-by와 $CONDITIONS
--split-by id : 쪼개는 컬럼 이름
$CONDITIONS : 내부 튜닝 값
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS
which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by
.
For example:
$ sqoop import \ --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ --split-by a.id --target-dir /user/foo/joinresults
Alternately, the query can be executed once and imported serially, by specifying a single map task with -m 1
:
$ sqoop import \ --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ -m 1 --target-dir /user/foo/joinresults
4) --boundary-query <sql 문>