How to Create Dataset with Python Faker

One of the greatest ways to learn and practice your analysis is using a real-world dataset. However, there is a time when its better to create your own dataset. In this case, you can specify some of parameters that fit your desires. We can use the Faker library to create a dataset in any language.

How to Use Faker to Create a dataset

The fundamentals of the Faker library that we can use it’s native functions to create an single element such as name, employee, address, zip code, occupation or etc. Salary and roles can also be randomly applied. The first step is to set your locale. Your locale allows you to specify where the names and locales will be generate. This is based on both location and language.

# lets select a localization and save library as a variable 
faker  = Faker('en_US')
#lets generate a name
faker.name()

the output of this would be “Robert’

Let’s create a first name, last name, job and address which will be added these to Python dictionary.

employee = {}
employee['first_name'] = fake.first_name()
employee['last_name'] = fake.last_name()
employeee['job'] =  fake.job()

Output

{'first_name': 'Jacob',
 'last_name': 'Cardenas',
 'job': 'Accommodation manager'}

Now to create multiple employees, we need to loop through the process to create more. Let’s use a For loop to create ten dictionaries and append them to an empty list.

#lets create an empty list to add our employee dictionaries 
employee_list = []

#let's create 10 dictionaries of employees
for i in range(1,10):
   employee = {}
   employee['first_name'] = fake.first_name()
   employee['last_name'] = fake.last_name()
   employee['job'] =  fake.job()
   employee_list.append(employee)

Let’s use the random_elements option from the Faker library to generate the roles and departments. Additionally lets randomize the salary with random_int for salary

employee["department"] = fake.random_element(elements=("IT", "HR", "Marketing", "Finance"))
employee["role"] = fake.random_element(elements=("Manager", "Developer", "Analyst", "Associate"))
employee["salary"] = fake.random_int(min=30000, max=150000, step=1000)

Lastly, we can create a data frame which would just require apply the dataframe function from the Pandas dictionary. Additionally pulling this all together all together into a function to get everything we need.

#import the libraries we need
from faker import Faker 
import pandas as pd
# create an instance of Faker
fake =Faker(locale='en_US')

#lets create a function
def create_employees(num_employees):
#lets create an empty list to add our employee dictionaries 
 employee_list = []
#let's create an employees dictionary
 for i in range(1,num_employees):
    employee = {}
    employee['first_name'] = fake.first_name()
    employee['last_name'] = fake.last_name()
    employee['job'] =  fake.job()
    employee["department"] = fake.random_element(elements=("IT", "HR", "Marketing","Finance"))
    employee["role"] = fake.random_element(elements=("Manager", "Developer", "Analyst", "Associate"))
    employee["salary"] = fake.random_int(min=30000, max=150000, step=1000)
    employee_list.append(employee)
 return pd.DataFrame(employee_list)

The final result would be a data frame that you can use for evaluation

Gaelim Holland

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments