Forum

This topic contains 0 replies, has 1 voice, and was last updated by  Paresh sahare 2 years, 1 month ago.

Viewing 1 post (of 1 total)
  • Author
    Posts
  • #1942 Reply

    Paresh sahare
    Participant

    file name :- Big data and Hadoop Questions1

    1.Difference between mapper and maptask ?
    Ans :-
    -Mapper – Mapper is one of the class used in the Map Reduce program.
    The main task of mapper class is to read data from input location, and based on the input type, it will generate a key value pair,
    that is an intermediate output in local machine.
    -MapTask – It is one of the entity of Job Tracker which helps to process the data. It works on a condition i.e it executes one block at a time.

    ================================================================================================================================================

    2. Anatomy of mapreduce ?
    Ans :-
    How work on back end
    1)Run Job in client JVM (MapReduce Program)
    2)Client will sent the request to Job Tracker Get Id from jobTracker
    3)Job Client need whatever resoures want to do job forn HDFS
    4)submit job to jobTracker
    5)(initialize job)JobTracker initializes the job by assigning task to the TaskTracker, it creates 2 entities namely MapTask and ReduceTask.
    MapTask works on the condition that it executes only one block at a time, hence we need multiple instances of of the MapTask which is defined
    by the Logical Split Size.
    6)retrieve input splits from HDFS
    7)In taskTracker node having two Part
    1)taskTracker :- taskTracker monitoring the program report as well
    2)child JVM :- Send the program report to taskTracker than taskTracker send to job Tracker.
    8)after completion of the job the JobTracker will send notification to the Client.

    ================================================================================================================================================

    3. where we define split size greater then block size ?
    Ans :-
    In mapred.max.split.size

    ================================================================================================================================================

    4. Difference between old API and new API ?
    Ans :-
    old API :-
    1) we can set no. of mappers by using setNumapTasks(int)
    2) submit by job useing collcector(Key , value).
    New API :-
    1) we dont have such method
    2) here write (key, value).

    ================================================================================================================================================

    5. what is shuffling and sorting ?
    Ans :- shuffling :- shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious
    that it is necessary for the reducers, since otherwise, they wouldn’t be able to have any input (or input from every mapper).
    Sorting :- Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It
    simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply.

    ================================================================================================================================================

    file name :-Big data and Hadoop Questions2

    1. how hadoop transaction is different form orcle transaction and also how read and write anotomy of hdfs is different from oracle databases ?
    Ans :-
    In orcle transaction
    ——————–
    Extend by
    HTTP,FTP,SMTP
    Managing by
    Subclasses,Tie classes,Servlets and Java server pages
    Store BY
    Repository :- the repository is a single mechanism that performs the task of transforming the contents of database rows and tables into objects
    such as files, folders, users, and groups.The Java API used to access the database

    In hadoop transaction
    ———————
    with the help of Hadoop daemons .
    having five daemons Each of these daemon runs in its own JVM
    3 hadoop Daemon’s run on Master Nodes:
    -NameNode – This Daemon are storing and maintains the metadata for Hadoop HDFS .
    -Secondary NameNode – Performs housekeeping functions for the NameNode.
    -JobTracker – Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

    ======================================================================================================================================================

    2) how to transfer data from one datanode to another datanode ?
    Ans :-
    All datanode keep send a haredbeat to namenode every 3 seconds to InFrom I am alive
    configuration in side hdfs-default.xml by default 3 seconds. if whenever a datanode stop working datanode are not send hardbeat to Namenode
    Namenode considers that datanode to be Dead. DataNode death In cause the replication factor are maintain that main replication block or all block are
    copy to another active datanode in the cluster

    case
    1)If Data Disk Failure
    2)Cluster Rebalancing
    3)Heartbeats Report

    ======================================================================================================================================================

    3)how nodes communicate each other ?
    Ans :-
    1)Namedone And Datanode communicates with in 3 way .
    1)Periodic Heartbeat :- Sending heartbeat signals to the NameNode every 3 sec
    2)Periodic Block Report :- Every 10th heartbeat Depending on configuration, the DataNode sends a block report to the NameNode.
    3)Completion of a replica write :- When the DataNode has successfully written a replica, it reports this event through an immediate blockReceived() message.

    1)Datanode And Datanode communicates each other
    The datanode can communicates to each other to rebalance data,move and copy around and keep the replication high.

    ======================================================================================================================================================

    4)what is logical split ?
    Ans :-
    1)logical split is the temp. logical split .
    2)Client defined The logical split .
    3)It represents the data that is to be processed (MR) .

    ======================================================================================================================================================

    5)if datanode goes down how the blocks of that data node shift to another active datanode ?
    Ans :-
    just like Datanode1 Datanode2 Datanode3 Datanode4 Datanode5
    each DataNode sends a heartbeat to the NameNode every 3 seconds
    if Datanode1 goes on
    DataNode1 Does not send Heartbeat message to name node
    DataNode death may cause the replication factor (In the configuration we set the replication factor in side hdfs-default.xml)
    Re-Replication So if datanode1 whatever data that in present the blocks of that data node shift to another active datanode

    ======================================================================================================================================================

    6.what are the major componets of algorithm used by namenode to allocate location of block on different datanode ?
    Ans :-
    1)NameNode allocates replicate number of blocks on DataNodes and stores that information using BlockManager.
    2)The client are use Client Protocol getBlockLocations and FSNameSystem.java uses INodeDirectory.java to locate the block numbers based on the file name,
    the offset, and the length.BlockManager.java locates the DataNodes on which the blocks are replicated. Returned to the client is a list of DataNodes

    ======================================================================================================================================================

    7)how to scale cluster configuration for particular data ?
    Ans :-
    1. General Configuration Tips
    —————————–
    a)Network setup:-
    b)Backup:-
    -FileSystem-Image data managed by the namenode(dfs.name.dir : latest image + edit logs)
    and by the secondary namenode(fs.checkpoint.dir : latest image)
    c)Replication:-
    -hdfs-site.xml in side. Changing the dfs.replication property
    d)Block Size:-
    -hdfs-site.xml in side . Changing the dfs.block.size property in hdfs-.
    e)Memory and task calculation:-
    Setting the task count is essential for a good functioning cluster.
    mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum together specify how many tasks can run on one node at a time.

    2. Required Configuration :-
    —————————-

    in side hadoop-site.xml Configuration
    conf/core-site.xml in side
    ————————–
    -hadoop.tmp.dir
    -fs.default.name
    -fs.checkpoint.dir
    -fs.default.name
    -hadoop.tmp.dir
    -conf/mapred-site.xml
    ———————-
    -mapred.tasktracker.reduce.tasks.maximum
    -mapred.tasktracker.map.tasks.maximum
    -mapred.system.dir
    -mapred.local.dir
    -mapred.job.tracker
    -conf/hdfs-site.xml
    ———————
    -dfs.data.dir
    -dfs.name.dir
    -dfs.datanode.du.reserved

    3 Optional Configuration
    ————————
    just like
    -dfs.replication
    -fs.trash.interval
    -mapred.map.tasks.speculative.execution
    -mapred.tasktracker.dns.interface
    -dfs.datanode.dns.interface
    -dfs.permissions
    4 Tuning Configuration
    ———————-
    just like
    -mapred.job.reuse.jvm.num.tasks
    -io.sort.mb
    -io.sort.factor
    -dfs.block.size
    -dfs.datanode.max.xcievers
    -io.file.buffer.size

    ======================================================================================================================================================

    9) write a algorithm in which namenode decide to get the location of datanode to store block of file ?
    Ans :-
    client required to namenode those namenode decide to get the location of datanode to store block of file
    The client are use Client Protocol getBlockLocations and FSNameSystem.java uses INodeDirectory.java to locate the block numbers based on the file name,
    the offset, and the length.BlockManager.java locates the DataNodes on which the blocks are replicated. Returned to the client is a list of location of
    datanode to store block of file

    ======================================================================================================================================================

    10)what is job of outputcollector ?
    Ans:-
    OutputCollector is collect data output by Mapper and The first two parameters define the input key and value types.
    Syntax –
    Output Collects <key, value>
    key – the key to collect.
    value – to value to collect.

    ======================================================================================================================================================

    11)In which we use split size greaterthen block size ?
    Ans:-
    In mapred.max.split.size
    Try to edit hdfs-site.xml for property dfs.block.size.

    ======================================================================================================================================================-

    12)write down the difference mappr and maptask ?
    Ans :-
    -Mapper – Mapper is one of the class used in the Map Reduce program.
    The main task of mapper class is to read data from input location, and based on the input type, it will generate a key value pair,
    that is an intermediate output in local machine.
    -MapTask – It is one of the entity of Job Tracker which helps to process the data. It works on a condition i.e it executes one block at a time.

    ======================================================================================================================================================

    13)write job of combiner ?
    Ans :-
    The job of a Combiner is to summarize the The output records of the Mapper with the same key.A Combiner just like
    a small reducer which runs in every DataNode take the inputs from the Mapper and thereafter put the output key-value pairs to the Reducer
    In combine two type of tuning
    1)PERFORMANCE TUNING – perform by developer
    2)Configuration tuning – Perform by admin
    Both are used for fast Processing and Performing

    ======================================================================================================================================================
    14) write a mapreduce program , to introduce combiner class ?
    Ans :-
    package Combiner;
    import java.io.IOException;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.DoubleWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.Mapper.Context;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;

    public class AverageSalary
    {
    public static class Map extends Mapper<LongWritable, Text, Text, DoubleWritable>
    {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
    String[] empDetails= value.toString().split(“,”);
    Text unit_key = new Text(empDetails[1]);
    DoubleWritable salary_value = new DoubleWritable(Double.parseDouble(empDetails[2]));
    context.write(unit_key,salary_value);

    }
    }
    public static class Combiner extends Reducer<Text,Text, Text,Text>
    {
    public void reduce(final Text key, final Iterable<Text> values, final Context context)
    {
    String val;
    double sum=0;
    int len=0;
    while (values.iterator().hasNext())
    {
    sum+=values.iterator().next().get();
    len++;
    }
    val=String.valueOf(sum)+”:”+String.valueOf(len);
    try {
    context.write(key,new Text(val));
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    }
    }
    public static class Reduce extends Reducer<Text,Text, Text,Text>
    {
    public void reduce (final Text key, final Text values, final Context context)
    {
    //String[] sumDetails=values.toString().split(“:”);
    //double average;
    //average=Double.parseDouble(sumDetails[0]);
    try {
    context.write(key,values);
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    }
    }
    public static void main(String args[])
    {
    Configuration conf = new Configuration();
    try
    {
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
    System.err.println(“Usage: Main <in> <out>”);
    System.exit(-1); }
    Job job = new Job(conf, “Average salary”);
    //job.setInputFormatClass(KeyValueTextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    job.setJarByClass(AverageSalary.class);
    job.setMapperClass(Map.class);
    job.setCombinerClass(Combiner.class);
    job.setReducerClass(Reduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    System.exit(job.waitForCompletion(true) ? 0 : -1);
    } catch (ClassNotFoundException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    }

    ======================================================================================================================================================

    15) Write a alogorithm for partioner class – name – testscore – mode (weekend and weekdays) ?
    1. which student scored hightest weekend and weekdays ?
    2. which student fresher or experience from weekdays and weekend scored highest ?

    Ans :-

    someting like that
    1)
    Name testscore mode profile

    ajx 50 weekend fresher
    sudu 60 weekdays experiene
    lalu 40 weekend experience
    kuku 50 weekdays fresher

    just like a

    ajx 50 weekend fresher

    Input
    ——
    put in Array

    ajx 50 weekend fresher
    [0] [1] [2] [3]

    output colleter
    —————
    //the mapper emits key, value pair where the key is the testscore and the value is the other information which includes name.mode.profile
    [2],[0][2][3]

    Partitione and shuffing & storing
    ———————————
    Partition – 0
    (testscore<80 )

    Reducer
    ——-
    (mode==weekdays)
    k1 list[V1]
    K2 list[v2]

    Output
    ——

    2)
    Name testscore mode profile

    ajx 50 weekend fresher
    sudu 60 weekdays experiene
    lalu 40 weekend experience
    kuku 50 weekdays fresher

    just like a

    ajx 50 weekend fresher

    Input
    ——
    put in Array

    ajx 50 weekend fresher
    [0] [1] [2] [3]

    output colleter
    —————
    //the mapper emits key, value pair where the key is the testscore and the value is the other information which includes name.mode.profile
    [1],[0][2][3]

    Partitione and shuffing & storing
    ———————————
    Partition – 0
    (testscore<80 )

    Reducer
    ——-
    (mode==weekdays && profile==experience)

    k1 list[V1]
    K2 list[v2]

    Output
    ——
    ===========================================================================================================================================

    file name :-hadoop questions

    1) What is HDFS ?

    -Big data AND large amounts of data quickly processing framework and managing .

    2) which is the single point of failure in a hadoop cluster ?

    -In hadoop Job Tracker And Name Node The single point of failure in a Hadoop cluster

    In Hadoop 1.x, This means that if Job Tracker and Name node dies Then all the application in a runing
    state will be lost.

    3) which component stores the metadata of the actual data stored ?

    -NAME NODE stores metadata information and edit log in it. Metadata information contains addresses of
    block locations of Datanodes, this information is used for file read and write operation to access
    the blocks in a HDFS cluster

    4) what component is responsible for data storage ?

    – Name node are responsible for data storage.client request to name node ,It provide which data node to store data .

    5) Task tracker and data node are they slave or masters ?

    -Task Tracker and data node are slave nodes.
    slave node are where Hadoop data is stored and where data processing takes place.

    6) The client writes to all the data nodes True or False ?

    -False, The client writes the block in only one Data Node directly. Then the Data Node forwards the block to the next Data
    Node.This cycle repeats for each block.
    7) Secondary Name Node act’s as a backup of NameNode – True or False
    -Secondary Name Node acts as a backup of metadata. Hence False.

    8)What is diffrence between stand alone node and pseudu mode?

    -stand alone node Hadoop is configured to run in a non-distributed mode, as a single Java process whereas in pseudonode Hadoop
    can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process with one node.
    -standalone node is good for the developing and map reduce whereas pseudonode is good for the testing environment.

    9)what is diffrence between pseudonode and fully distributed node?

    -Pseudonode Hadoop can also be run on a single-node(one node will be used as Master Node / Data Node / Job
    Tracker / Task Tracker) in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process with
    one node whereas in fully distributed mode Data are used and distributed across many nodes.Different Nodes will be
    used as Master Node / Data Node / Job Tracker / Task Tracker.pseudonode is good for the testing environment
    -fully distributed node is good for production .

    file name :- HDFS DUMP QUESTIONS

    HDFS:-
    1. Name the most common Input Formats defined in hadoop? Which one is default?
    Ans :-
    1.TextInputFormat
    2.KeyValueInputFormat
    3.SequenceFileInputFormat
    TextInputFormat is the Hadoop default.

    ==================================================================================================================

    3. What are some typical functions of Job Tracker?
    Ans :-
    1) It Accepts jobs from clients .
    2) It communicate to the NameNode to determine the location of the data .
    3) Job tracker give task to tasktracker .

    ==================================================================================================================

    4. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
    Ans :-It will restart the task again .

    ==================================================================================================================

    5. What is Hadoop Streaming?
    Ans :-
    Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any
    programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby
    and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other
    programming language.

    ==================================================================================================================

    6. What is Distributed Cache in Hadoop?
    Ans :-
    DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.)
    needed by applications.

    ===================================================================================================================

    7. Is it possible to have Hadoop job output in multiple directories? If yes, how?
    Ans:-yes,
    The MultipleOutputs class simplifies writing output data to multiple outputs

    Here is an example of MultipleInputs

    MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, MyMapper.class);
    MultipleInputs.addInputPath(job, inputPath2, TextInputFormat.class, MyOtherMapper.class);

    ===================================================================================================================

    8. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
    Ans:-that time it will be Overwrite it and Hadoop job will throw an exception and exit.

    ===================================================================================================================

    10. How did you debug your Hadoop code?
    Ans:-
    With the help of web interface

    ===================================================================================================================

    file name :- MAPREDUCE Q & A

    1. What is Hadoop Map Reduce ?
    ans :-
    MapReduce is a framework using huge amounts of data processing technique by use java.
    The MapReduce algorithm having two important tasks.
    1)Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
    2)Reduce Task Which Takes The o/p From a map as an i/p And Combines Those Data Tuples into a Smaller Set of Tuples.

    ==============================================================================================================================================

    2. Explain what combiners is and when you should use a combiner in a MapReduce Job?
    Ans :-
    A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the
    output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to
    the reducers.

    ==============================================================================================================================================

    3.What happens when a datanode fails ?

    Ans :-
    All datanode keep send a haredbeat to namenode every 3 seconds to InFrom I am alive
    configuration in side hdfs-default.xml by default 3 seconds. if whenever a datanode stop working datanode are not send hardbeat to Namenode
    Namenode considers that datanode to be Dead. DataNode death In cause the replication factor are maintain that main replication block or all block are
    copy to another active datanode in the cluster

    ==============================================================================================================================================

    4. Explain what is Speculative Execution?
    Ans :-
    -Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the
    mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution.

    -One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest
    of the program .

    ==============================================================================================================================================

    5. Explain what are the basic parameters of a Mapper?
    Ans :- Mapper<LongWritable, Text,IntWritable, Text>

    ==============================================================================================================================================

    6. Explain what is the function of MapReducer partitioner?
    Ans :-
    -we use it
    int getPartition(K key, V value, int numReduceTasks)
    -This function is responsible for returning you the partition number and you get the number of reducers .you fixed when starting the job from the
    numReduceTasks variable, as seen for in the HashPartitioner.
    -if you not use partitioner in class by default HashPartitioner.

    ==============================================================================================================================================

    7. Explain what is difference between an Input Split and HDFS Block?
    Ans :-

    1)Client defined The logical split And Logical division of data is known as Split
    2)physical division of data is known as HDFS Block

    ==============================================================================================================================================

    8. Explain what happens in textinputformat ?
    Ans :-

    In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line.
    For instance, Key: longWritable, value: text.

    ==============================================================================================================================================

    9. Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?
    Ans :-
    The job’s input location(s) in the distributed file system.
    The job’s output location in the distributed file system.
    The input format.
    The output format.
    The class containing the map function.
    The class containing the reduce function but it is optional.
    The JAR file containing the mapper and reducer classes and driver classes.

    ==============================================================================================================================================
    10.Explain what does the conf.setMapper Class do ?

    Ans :-
    Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of
    the mapper
    ==============================================================================================================================================

    file name :- SQOOP Q & A

    1)What are the diffrent database sqoop can support?
    Ans:- 1)my sql
    2)oracle,
    3)postgre sql
    4)hsqldb
    5)IBM netezza
    6)teradata

    =========================================================================================================================

    2)what is needed to connect sqoop to the data base?
    Ans:- Without connector we cannot connected to like mysql or sql databases
    Hadoop sqoop is a Data transfer tool

    =========================================================================================================================

    3)what is the default execution of the files produced from sqoop import using the parameter-compress.
    Ans:- bin/sqoop import –connect jdbc:mysql://192.168.243.1/ Softech –table Employee –username root – P –target-dir /sqoopOut1 -m 1

    =========================================================================================================================

    4)During the sqoop import,records can be stored as
    Ans:-All records are stored as text data in the text files or as binary data in Avro.

    =========================================================================================================================

    5)why do we create primary key ?
    Ans:-
    1)You need your table to be Join
    2)unique identification .

    =========================================================================================================================

Viewing 1 post (of 1 total)
Reply To: All Assignment Paresh Sahare
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.