How to Create Dataset with Python Faker
One of the greatest ways to learn and practice your analysis is using a real-world dataset. However, there is a time when its better to create your own dataset. In this case, you can specify some of parameters that fit your desires. We can use the Faker library to create a dataset in any language.
How to Use Faker to Create a dataset
The fundamentals of the Faker library that we can use it’s native functions to create an single element such as name, employee, address, zip code, occupation or etc. Salary and roles can also be randomly applied. The first step is to set your locale. Your locale allows you to specify where the names and locales will be generate. This is based on both location and language.
# lets select a localization and save library as a variable
faker = Faker('en_US')
#lets generate a name
faker.name()
the output of this would be “Robert’
Let’s create a first name, last name, job and address which will be added these to Python dictionary.
employee = {}
employee['first_name'] = fake.first_name()
employee['last_name'] = fake.last_name()
employeee['job'] = fake.job()
Output
{'first_name': 'Jacob', 'last_name': 'Cardenas', 'job': 'Accommodation manager'}
Now to create multiple employees, we need to loop through the process to create more. Let’s use a For loop to create ten dictionaries and append them to an empty list.
#lets create an empty list to add our employee dictionaries
employee_list = []
#let's create 10 dictionaries of employees
for i in range(1,10):
employee = {}
employee['first_name'] = fake.first_name()
employee['last_name'] = fake.last_name()
employee['job'] = fake.job()
employee_list.append(employee)
Let’s use the random_elements option from the Faker library to generate the roles and departments. Additionally lets randomize the salary with random_int for salary
employee["department"] = fake.random_element(elements=("IT", "HR", "Marketing", "Finance"))
employee["role"] = fake.random_element(elements=("Manager", "Developer", "Analyst", "Associate"))
employee["salary"] = fake.random_int(min=30000, max=150000, step=1000)
Lastly, we can create a data frame which would just require apply the dataframe function from the Pandas dictionary. Additionally pulling this all together all together into a function to get everything we need.
#import the libraries we need
from faker import Faker
import pandas as pd
# create an instance of Faker
fake =Faker(locale='en_US')
#lets create a function
def create_employees(num_employees):
#lets create an empty list to add our employee dictionaries
employee_list = []
#let's create an employees dictionary
for i in range(1,num_employees):
employee = {}
employee['first_name'] = fake.first_name()
employee['last_name'] = fake.last_name()
employee['job'] = fake.job()
employee["department"] = fake.random_element(elements=("IT", "HR", "Marketing","Finance"))
employee["role"] = fake.random_element(elements=("Manager", "Developer", "Analyst", "Associate"))
employee["salary"] = fake.random_int(min=30000, max=150000, step=1000)
employee_list.append(employee)
return pd.DataFrame(employee_list)
The final result would be a data frame that you can use for evaluation