This topic contains 1 reply, has 2 voices, and was last updated by  Sankhabrata Burman 4 months, 3 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
  • #3037 Reply


    Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks.
    When Hadoop submits jobs, it splits the input data logically and process by each Mapper task.
    The number of Mappers are equal to the number of splits.
    One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data.

    A split basically has 2 things :

    a length in bytes and a set of storage locations, which are just hostname strings.

    Block size and split size is customizable. Default block size is 64Mb and default split size is equal to block size.

    1 data set = 1….n files = 1….n blocks for each file

    1 mapper = 1 input split = 1….n blocks

    InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper.

    By default this class is going to create one input split for each HDFS block.

    #5597 Reply

    Sankhabrata Burman

    As a user, we don’t need to deal with InputSplit directly, as they are created by an InputFormat
    (as InputFormat is responsible for creating the Inputsplit and dividing into records).
    FileInputFormat, by default, breaks a file into 128MB chunks (same as blocks in HDFS) and by setting mapred.min.split.size
    parameter in mapred-site.xml we can control this value or by overriding the parameter in the Job object used to submit a
    particular MapReduce job. We can control how the file is broken up into splits, by writing a custom InputFormat.

Viewing 2 posts - 1 through 2 (of 2 total)
Reply To: 2) write the function called for splitting the user data into blocks?
Your information:


Your Name (required)

Your Email (required)


Phone No

Your Message


  • No products in the cart.