SQL Interview Questions Sava
SQL Interview Questions Sava
1. How to create an empty table in hive from another table without copying data?
use retail;
create table paymentscopy as select * from payments where 1=2;
or
Efficient way -
show create table payments;
3. Can we insert data into hive table? multiple records in one time without using load command?
Yes, refer the script mentioned above.
Duplicate handling:
Option 1:
select distinct * from payments; -- 274 records
5. How to identify which are the customer table records are duplicated and how many duplicates are
there? When do you use where and when do you use having for filtering of data?
Ans: where used for direct filter, where having used for aggregated filter.
select - from - where - group by - having - order by - limit;
select customernumber, customername, contactlastname, contactfirstname, phone, addressline1,
addressline2, city, state, postalcode, country, salesrepemployeenumber, creditlimit,count(1)
from customers
group by customernumber, customername, contactlastname, contactfirstname, phone, addressline1,
addressline2, city, state, postalcode, country, salesrepemployeenumber, creditlimit
having count(1)>1;
6. How to remove/delete duplicate payments and retain only the de duplicated payment information in a
table with or without partition?
How do you delete duplicate data from a hive table? or can u delete data from hive table?
--don’t run this insert - insert overwrite table payments select distinct * from payments;
select * from payments_dup;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table payments_part partition (paymentdate) select
customernumber,checknumber,amount,paymentdate from payments;
select customernumber,checknumber,amount,paymentdate,count(1)
from payments_part
group by customernumber,checknumber,amount,paymentdate
having count(1)>1;
select customernumber,checknumber,amount,paymentdate,count(1)
from payments_part
group by customernumber,checknumber,amount,paymentdate
having count(1)>1;
7. Show the customer info who made very first payment to our company? or the very first customer of the
company?
select * from payments_part
where paymentdate in (select min(paymentdate) from payments_part );
Show the first payment made by a given customer to our company?
select * from payments where customernumber=496;
select * from payments where customernumber=496 and paymentdate in (select min(paymentdate) from
payments_part where customernumber=496);
select * from payments where customernumber=496 and paymentdate in (select max(paymentdate) from
payments_part where customernumber=496);
9. Show the last (but one) or second recent payment made by the customer 496?
This is the way u can achieve the result, but not a right way to do, in hive the nested subquery with more
than 1 is not supported.
10. Show the lowest payment, highest, number of payment made by the customer 496?
insert into `payments_part` partition(paymentdate='2016-10-30') values (496,'HQ336436',52166.01);
Windowing Functions:
Cume_dist
It returns the cumulative distribution of a value. It results from 0 to 1. For suppose if the total number of records are 10 then for the 1st row
the cume_dist will be 1/10 and for the second 2/10 and so on till 10/10.
Rank
The rank function will return the rank of the values as per the result set of the over clause. If two values are same then it will give the same
rank to those 2 values and then for the next value, the sub-sequent rank will be skipped.
Row_number
Row number will return the continuous sequence of numbers for all the rows of the result set of the over clause.
Dense_rank
It is same as the rank() function but the difference is if any duplicate value is present then the rank will not be skipped for the subsequent
rows. Each unique value will get the ranks in a sequence.
set hive.cli.print.header=true;
select customernumber,paymentdate,amount, rank() over(partition by customernumber order by amount
desc) as rnk,dense_rank() over(partition by customernumber order by amount desc) as
d_rank,row_number() over(partition by customernumber order by amount desc) as rownum,cume_dist()
over(partition by customernumber order by paymentdate) as cumulative_dist from payments_part where
customernumber=496;
114 MA765515 82261.22 2016-10-15 1
114 GG31455 45864.03 2016-10-20 2
114 NR27552 44894.74 2016-10-10 3
114 NP603840 7565.08 2016-10-31 4
R_N D_R Rank
496 MN89921 52166.01 2016-10-30 1 1 1
496 HQ336436 52166.01 2016-10-31 2 1 1
496 MB342426 32077.44 2016-10-16 3 2 3
496 EU531600 30253.75 2016-10-25 4 3 4
12. Show the customer purchase rate whether growing up or leaning down from the largest and smallest
payment made?
13. Show the customer purchase rate whether growing up or leaning down from the immediate previous
and next payment made?
Analytical Functions:
--case when condition then exec when cond2 then exe else exec end as alias
select customernumber,paymentdate,amount, case when amount_paid_previous_day> amount then
"prior payment is high" when amount_paid_previous_day= amount then "prior payment is same" else
"prior payment is low" end as lag,
case when amount_paid_next_day> amount then "next day payment is high" when
amount_paid_next_day= amount then "next day payment is same" else "next day payment is low" end as
lead
from(select customernumber, paymentdate,lag(amount) over(partition by customernumber order by
paymentdate) as amount_paid_previous_day,amount, lead(amount) over(partition by customernumber
order by paymentdate) as amount_paid_next_day from payments_part where customernumber in (496))
as temp
order by customernumber,paymentdate;
16. How to create version number for the payments made by the customers?
current data
cid ver amt
10 1 1000
10 2 1100
10 3 3000
10 4 2000
10 5 3050
next day
cid amt
10 2000
10 3050
Joins: Inner, Outer (left, right, full), semi, anti, self, cross
insert into `payments` values
(1000,'HQ336336','2016-10-19','6066.78'),(1003,'JM555205','2016-10-05','14571.44');
17. Show me only the customers who made the payments and what is the payment amount (this is possible
only with join).
Join/subquery/exists can be used, but for multiple columns comparison for eg. customernumber and
customerphonenumbedr we cant use subquery, we can only use join in the case if we need some columns
from payments table also.
19. Show me the customers with no details who made the payments anonymously
select c.customernumber,c.customername,p.amount from customers c right outer join payments p on
c.customernumber=p.customernumber where c.customernumber is null;
20. show the customers and payments info in both the above cases
select c.customernumber,c.customername,p.amount from customers c full outer join payments p on
c.customernumber=p.customernumber;
select c.* from customers c left semi join payments p on (c.customernumber=p.customernumber and
c.customernumber=496);
23. show me the combined result of two different tables picking 1 column from table1 and 2 columns from
table2.
select p.customernumber,p.amount,'NA' as COUNTRY from payments as p where P.amount between 6000
and 10000
union all select customernumber,0 as AMOUNT,country from customers where customernumber>400;
24. Customer who made the payment (insersect), who are the anonymous customers minus (above use
cases)?
Hive doesn't support intersect or minus, to use it we need to use inner join or left outer join
respectively.
25. Show me the customers who have more than 1 phone number and more than 1 address?
insert into `customers` values (103,'Atelier graphique','Schmitt','Carine ','908-199-0411','55, rue
Royale',NULL,'Nantes',NULL,'44000','France',1370,'21000.00');
Complex types:
26. How do you create structure data from set of columns in hive, how do you group the list of columns and
form complex type like structure or json objects?
select named_struct("id",customernumber,"name",customername) from customers;
27. How to collect the columns as array or how to group or convert column to rows in Hive? Eg. Group all
phone numbers of the given customer? or how do you pivot a column into row?
select customernumber,collect_list(phone) as grouped_phone from customers where customernumber in
(103,112) group by customernumber;
28. How to collect the array as rows or Eg. UnGroup all webpage visited numbers of the given customer? or
how do you unpivot an array column into row of values?
select explode(pagenavigation) from orderpages ;
29. Show me the position of the array elements in an array column in Hive? or show the order in which
customer navigated in the webpage which is stored as array?
select posexplode(pagenavigation) from orderpages ;
30. Display the customer number, comments and the pages navigated from the array data ?
We need to use Lateral view is used in conjunction with user-defined table generating functions such as explode()
UDTF (user defined table generating functions). A lateral view first applies the UDTF to each row of base table and
then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.
select customernumber,comments,pgnavigation_column
from orderpages lateral view explode(pagenavigation) exploded_tbl as pgnavigation_column;
Ordering:
31. Why order by in hive is more costly? how can you avoid using it?
Order by will go with single reducer operation, can be avoided with multi reducer operation using
distribute by and sort by.
set mapred.reduce.tasks=3;
select * from payments order by customernumber,amount ;
select * from payments distribute by customernumber sort by customernumber,amount ;
select * from payments cluster by customernumber,amount;
or
select * from payments order by customernumber,amount limit 100;
34. Is it Possible to Delete and Update in Hive Table. Have You used in Your Project?
Yes, partition wise we did it or using ACID properties or using insert overwite.
36. A table have 2000 records B table have 1000 records.. B and a table have 500 records matched between
them.. Then what is count output you will get if you do inner join, left jojn, right join, full outer join, cross join
inner join - 1000
left outer - 2000
right outer - 1000
full outer - 2000
cross join - 2000 x 1000 = 2000000
40. Where you see the sql pivot kind of feature in hive/spark
To convert rows to columns and vice versa.
explode, lateral view or pivot options can be used.
46. write a query - How to fetch the latest record ? or how to fetch most recent transaction?
select * from (select row_number() over () as rno,t.* from txnrecords as t) as temp
where rno in (select count(1) from txnrecords);
select txnno,custno,amount,category,product,city,state,spendby,
from_unixtime(unix_timestamp(txndate,'MM-dd-yyyy'),'yyyy-MM-dd'),
from_unixtime(unix_timestamp(txndate,'MM-dd-yyyy'))
from txnrecords;
This will be used in the Star schema data model in the DWH applications, small tables (dimension tables)
in memory in all of the mappers and big table (fact table) is joined over it in the mapper.
This avoids shuffling cost that is inherent in Common-Join. A hash table would be created for each of the
small table (dimension table), using join key as the hash table key.
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 10000000;
//default 10mb
set hive.mapjoin.smalltable.filesize=100;
//default 25mb
explain select /*+ MAPJOIN(a) */ a.* from customers a inner join payments b on
a.customernumber=b.customernumber;
Bucket Map Join
While the tables are large and all the tables used in the join are bucketed on the join columns we use Hive
Bucket Map Join feature. Moreover, one table should have buckets in multiples of the number of
buckets in another table in this type of join
While all tables are Large.
Also, while all tables are bucketed using the join columns.
While by using the join columns, Sorted.
Also, when the number of buckets is same as the number of all tables.
set hive.optimize.bucketmapjoin;
set hive.enforce.bucketing = true ;
explain select /*+ MAPJOIN(a) */ a.* from customersbuck a inner join paymentsbuck b on
a.customernumber=b.customernumber;
3Skew Join :
When there is a table with skew data in the joining column, we use skew join feature. On defining what is
skewed table, it is a table that is having values that are present in large numbers in the table compared to
other data.
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;