Hive数据导入HBase

最新推荐文章于 2024-08-04 03:37:33 发布

原创

最新推荐文章于 2024-08-04 03:37:33 发布 · 1w 阅读

标签

#hbase

收录于

本文介绍了3种将Hive数据导入HBase的方法，重点讲解了一种利用MapReduce生成Hfile并导入的高效策略，包括6个步骤：1. Range Partitioning，2. 创建HBase存储表，3. 生成HFile，4. 创建HBase表，5. 导入HFile，6. 测试数据。此外，还提到了一些常见问题及解决办法。

Hive数据导入HBase大致有3中方法

在Hive创建数据保存在HBase的表方式，这种方法的特点是简单，但是数据量超过千万以后，数据偏移现象比较明显，效率不高
在定义Hive的UDF，将数据写入HBase，如果提前将HBase表的regen分好，这种直接put的方法效率还行
直接用MapReduce生成Hfile，然后导入HBase，这种方法的特点是程序步奏很多，但是效率高，每分钟轻松能到3000万数据

下面介绍一下我这种用MapReduce生成Hfile，然后导入HBase的方法，一共分为6步：

1. 生成Range Partitioning

这一步决定后面生成HFile时的reduce的个数，比如下面这段sql共生成137556197/100/22000=62条记录，则生成HFile时用63个reduce

select list_no,row_sequence() from (
select
list_no,
row_sequence() as row
from (
select
list_no
from user_info tablesample(bucket 1 out of 100 on list_no) s order by list_no
) t order by list_no
) x where (row % 22000)=0 order by list_no ;

2. 创建存储HFile数据表
存储HiveHFileOutputFormat数据

CREATE EXTERNAL TABLE IF NOT EXISTS hbase_user_info(
list_no string,
asset_id double,
...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
TBLPROPERTIES('hfile.family.path' = '/tmp-data/user/hbase/hbase_hfiles/user_info/f');

3. 生成HFile

INSERT OVERWRITE TABLE hbase_user_info
SELECT * FROM user_info CLUSTER BY list_no;

4.