1. Are there any scenario where things could have been done through only map reduce and not HIVE.
Answer). For large unstructured data sets, mapreduce is the ideal one. Hive works better on structured data and it can work on unstructured data to some extent.
2. what are the Advanced Map reduce functionality
1. Chaining MapReduce jobs.
2. Joining data from different sources.
3. Creating a Bloom filter
3.What is the difference between reduce method and reduce instance?
Answer). Reduce Instance: Tasktracker creates a separate JVM for each reduce task and this contains a single instance of Reducer class. This is called as Reduce Instance.
Reduce Method: Reduce Instance contains reduce() method.
4. Can we modify the data format and force them to go into one mapper? In this case does it have multiple reducers?
5.What is the optimal number of reducers? Whether hadoop smart enough to decide how many number of
reducers are required?
Answer 1). The optimal number of reducers related to total number of available reducer slots in the cluster. The total number of slots is found by multiplying the number of slots per node and number of nodes in the cluster. Number of reduce slots per node determined by the property ‘mapred.tasktracker.reduce.tasks.maximum’.
Answer2).Hadoop can’t decide this. By default there will be one reducer.
6.For a given volume of data , how do you decide how many reducers to set?
Answer). This is again based on the cluster size, memory & processor capacity of the nodes in the cluster.
7.Can two reducers talk to each other?
Answer). Yes! It can be done through chaining of reducers.
8.What is group camparartaor
Answer). When a reduce method id called on Reduce Instance, list of map output values grouped by the key thats defined in Group Comparator. Otherwise, by default, group comparator uses the entire map output key.
9.How to set number of reducer for map reduce , during run time?
Answer). Using “hadoop pipes” command.
How to set using command line prompt?
Answer). “-D mapred.reduce.tasks” parameter
How to set it in Hive Job and how to set it in Map reduce
Answer). In MapReduce this can be done in two ways.
1. Using property “mapred.reduce.tasks” in mapred-site.xml
2. By calling setReduceNumJobs on JobConf object.
10. Apart from reducer, how do you set parameters in map reduce through command promt.
Answer). “hadoop job” command
11. In map side join how big the small table can be?
12.In map side join , If there is memory problem , how to overcome that?
Answer). By using Distributed cache.
13.Is it recomended to compress text file?
Answer). Yes! It causes files to be transferred faster over the network than uncompressed files.
14. LZO compression on text file is splittable?
15.have written UDF??