EDA Problem Creation [Data Science]

Modified on Wed, 23 Apr at 7:10 PM

This document provides a structured explanation of all the fields required when creating a problem on DoSelect platform. Each field has a detailed purpose and expectations.

TABLE OF CONTENTS

Create New Problem
Select Problem Type
Problem Name
Dataset Information
Difficulty Level
Scoring
Skill Tags
Datasets
Stubs
Sample Solution
Test Cases

Create New Problem

You need first to log in and land on the home page of the DoSelect platform.

Go to the Library Section, as shown in the image below.

After landing on the Library Page, you need to go to the My Company Section and click on the Create New Problem icon in the Right Bottom Corner, as shown in the below image.

Now, follow the steps accordingly.

Select Problem Type

Once you click on the create a new problem icon, you will get to see a window that pops on your screen asking you to fill in some details to get started with the problem creation. An image is attached below to show the window pop-up.

Name Field: You need to enter a name that is relevant to the problem you are creating. The problem name should be a minimum of 6 characters and a maximum of 45 characters.
Problem Type: Here, you need to choose a relevant problem type. (For Exploratory Data Analysis, choose Data Science as the Problem Type from the drop-down menu.)
Level: Now, you need to choose the level of the problem. Available levels are Easy, Medium, and Hard.
Once you are done filling in the details, click on the Create button.

Once you click on the create button, as mentioned in the previous step, a ready problem skeleton will appear on the screen, as shown in the below image. Now you have to start filling the fields one by one to complete the problem creation.

Problem Name.

A problem name field is to indicate the name of the problem. You will get to see the problem name pre-filled by a name that you have entered while furnishing step 1.

You can move on with the same name or else you can edit it as per your requirement.

Expected Solving Time.

The next field in the line is Expected Solving Time (in minutes). Here, you have to enter the time in which the candidate is expected to solve the problem. The time value also depends on the level of the problem. Below are the recommended time values in minutes for all the levels.

Hard: 90 minutes
Medium: 60 minutes
Easy: 30 minutes

Problem Description.

This is the most important field in the problem creation process. Here, you need to follow a particular template.

First, give the heading as Problem Statement. The Problem Statement should be a story-based description of your problem. It should clearly define the Exploratory Data Analysis Tasks that you are expecting from the candidate to be performed.
Limit the problem statement to a maximum of 100 words.
The Problem Statement heading should be in Heading 1 and Bold format.
The Problem Statement description should be in Normal format.
The Tasks should be in Bold and Normal format.

Dataset Information.

Next, you have to give information about the dataset in the Problem Description itself after the Problem Statement.

Mark the heading as Dataset. It should be in Heading 1 and in Bold format.
The first bullet point is supposed to be “You can use the given dataset for the EDA.”
The second bullet point should give the names of the columns in the dataset. Make sure that the column names in the dataset match with the column names provided in the Dataset Information. These column names have to be in Normal Bold Italic format.
The third bullet point has to give the location from where the candidate can access the dataset for performing the tasks. The location “/data/training/dataset.csv” has to be put in the Code format.
Please follow the image below for reference. Make sure you follow the formatting in a similar way.

In EDA, when you are providing the input dataset, make sure that it is in CSV format.
Our platform deals only with CSV files.
If there is only one dataset, name the file as dataset.csv, then you need to zip the CSV file and upload it in the dataset section as mentioned in Step 16.
If you have multiple input datasets, then the first dataset should be named as dataset1.csv, the second should be named as dataset2.csv and so on. After naming, first, you need to zip all the CSV files and then upload that zipped file in the dataset section.
While giving the locations, you will find that ‘/data/training/’ is common for all the datasets.

Refer to the below image for only one input dataset:

Refer to the below image for multiple input datasets:

Output Information.

Next, you need to go for providing the information about the Output after Dataset Information in the Description itself. This part will give the information about the locations where the candidate has to save his / her output files, which are in CSV format.

First, give the heading as Output, which is supposed to be in Heading 1 and Bold format.
Next, mention the line “Save the output file at the given location with the following data”.
After this, indicate the file location for every task with the respective task above it.
When there is only one task, there will be only one output file. This will be saved at “/code/task.csv”. When there are multiple tasks, then the first task output file will be saved at “/code/task1.csv”, the second task output file will be saved at “/code/task2.csv” and so on.
Make sure that the tasks are in the Bold format and the locations are in the Code format.

Refer to the below image and do the formatting accordingly.

Sample Output.

After Output Information, you have to give the insights about Sample Output in the Problem Description.

The heading Sample Output should be in Heading 1 and Bold format.
Next, you need to display the Task Name, which should be Bold and Bulleted.
After this, you need to provide an image of the Sample Output. Make sure that the image does not have the expected output, as candidates can see this.
Sample Output is just to indicate the Output Data format. It will showcase the expected number of columns, names of the columns, and Output Structure.

A sample output somewhat looks similar to the image attached below.

Additional Note.

At last, in the description section, after the Sample Output, mention an Addition Note for every Data Science EDA Problem, as shown in the image below.

Difficulty Level.

After filling in the Problem Description, the Difficulty level is something that will appear in the line. While filling in the details in Step 1, you have already chosen the difficulty level, the same difficulty level appears here. If you want to change the difficulty level, you can modify it accordingly.

Scoring.

The next field is Scoring. Once you choose the appropriate Difficulty for the problem, the scoring value will be assigned automatically.

Maximum re-submissions allowed.

After Scoring Field, you will get to see this field, and this field will put restriction on the number of submissions a candidate can make. The last submission will be considered for the evaluation. If you don’t want to put restrictions on the resubmissions, you can enter the value as zero.

Evaluation Time Limit.

This comes after the Maximum re-submissions allowed field. Leave this field empty, as it is not a required field.

Allowed Programming Languages.

The next field in the line is Allowed Programming Language. For EDA, there are only three languages that are allowed. Python 2, Python 3, and R will appear in the drop-down menu. If you select one, then the candidate will have to write the code in the respective programming language. To allow all the languages, leave this field empty.

$C:\Users\pratham.javalkoti\Pictures\Screenshots\Screenshot 2025-03-23 225305.png$

Skill Tags

The next field is skill tags. Skill Tags define required expertise.

Discovery Tags: Helps platform users to find a problem related to a particular skill.
Insight Tags: Used for candidate’s skill analysis in the Test Report Card.

Datasets.

Datasets are one of the most important fields. Here, the datasets are expected in the zipped format. In the zipped file/folder, the datasets have to be in the CSV format.

If there is only one dataset, make sure that you name it as dataset.csv in your local machine.
Once you do that, zip the file and upload it to the Training Dataset section.
If you have multiple datasets, then you need to name the dataset that is required to perform the first task as dataset1.csv, then name the dataset that is required to perform the second task as dataset2.csv, and follow this pattern accordingly.
Once you do that, select all the datasets and zip them in a file/folder. This zipped file/folder is supposed to be uploaded in the Training Dataset Section.

The sections Evaluation Dataset and Validation Dataset will not hold any part in the EDA problem. As they are marked as required fields, you can upload the same zipped file/folder that you have uploaded in the Training Dataset Section. Refer to the image below for better clarity.

Please do not include more than 20 to 25 features/attributes/columns in the dataset. This will give a better experience on the platform.

Stubs.

The next comes the Stubs field. Here, you need to provide python code which acts like a boiler plate to begin with the EDA.

You need to click on the ADD NEW button as shown in the image below.

As soon as you click on the Add New button, you will see a window pop up in which an editor is present. Here, in the editor, you have to paste your boilerplate. For example, you can refer to the image below.

Sample Solution.

The next comes the Sample Solution field. In this field, you have to provide the Python code, which is the solution for the EDA task/tasks.

First, you need to click on the ADD NEW Button. Once you click on the ADD NEW Button, a window will pop up, and you will get to see the editor. In the editor, you have to paste your code.

Refer to the images below:

Test Cases.

The test cases field is the next in the process. Here, the output that has been generated by the candidate and saved at the desired location is compared with the actual expected output.

If there are multiple tasks in the problem statement, then you need to create multiple test cases. Each test case will be dealing with a respective task.
If there is only one task, then there will be only one test case.

To add a test case, first, you need to click on the ADD NEW Button.

After clicking on the ADD NEW Button, a window pops up. Here, in the window, you will find an editor in which you need to access the candidate’s code from the desired location and access the actual output of the Task from the correct location and perform the necessary comparison.

A sample test case is shown in the images below for your reference.

Attachments.

At last, you will get to see he attachment field. This field is provided to upload the actual output/outputs of the task/tasks.

If there are multiple tasks in the problem statement, then you need to upload multiple attachments. Each file attachment will have the actual output of the respective task.
The actual output files have to be in CSV format only.
To cross-verify the outputs in the test cases, we need to copy the links provided in the attachments section after uploading the files and put those into the test cases to fetch the data and compare. For this, you can refer to the sample test case code provided in the images of Step 19.

This completes the EDA problem creation on the DoSelect Platform.

EDA Problem Creation [Data Science]

Create New Problem

Select Problem Type

Problem Name.

Dataset Information.

Difficulty Level.

Scoring.

Skill Tags

Datasets.

Stubs.

Sample Solution.

Test Cases.