Filter Data with Python Query
I believe there a several Python functions that every data analyst should learn. VALUE_COUNTS, GROUPBY and QUERY is also on the list. Of course all these are part of the PANDAS library. So Let’s explore the query function. This function allows for filtering data frames using a concise and readable syntax.
What is the basic query syntax
- import Pandas
- Convert your data to a dataframe
- Use the query function to filter your data with dataframe.query(‘contidion’)
Why You Should Use Query
Of course, you can definitely filter data using bracket or filter notation. So here are few reasons why you should use query.
- Readability: The query syntax is more readable, especially for those familiar with SQL.
- Convenience: It can make code shorter and more concise.
- Performance: For large datasets,
query
can be more efficient than traditional methods.
Let’s say you have the following DataFrame named df
:
Name | Age | Gender | |
---|---|---|---|
0 | John | 28 | Male |
1 | Anna | 34 | Female |
2 | Peter | 29 | Male |
3 | Linda | 40 | Female |
If you want to select only the rows where the Age
is greater than 30, you can use the query
function as follows:
result = df.query('Age > 30')
This will give you:
Name | Age | Gender | |
---|---|---|---|
1 | Anna | 34 | Female |
3 | Linda | 40 | Female |
Lets Start with Some Practice Data
import pandas as pd
# Sample data
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 40],
'Gender': ['Male', 'Female', 'Male', 'Female'],
'Income': [70000, 80000, 120000, 110000]}
df = pd.DataFrame(data)
Traditional Dataset Filtering
We can you bracket notation to filter data easily. Fore example if we anted to know filter by both gender and income, this would require a code that mirrors the one in the below box.
filtered_df = df[(df['Gender'] == 'Female') & (df['Income'] > 80000)]
Let’s measure this against the simplicity of using the query function syntax which is much cleaner and readable.
filtered_df = df.query("Gender == 'Female' and Income > 80000")
Advanced Filtering
The query
function shines in more complex scenarios. For instance, you can use variables in your queries.
Using Variables in Queries and Complex Queries
Suppose you have a variable min_income
:
min_income = 80000
filtered_df = df.query("Income > @min_income")
filtered_df = df.query("Age < 35 and Gender == 'Male' | Income >= 100000")
Watch a YouTube Video on Using Query
Tips for Using the Query Function
- String Columns: For string comparisons, make sure to use single quotes within the double-quoted query string.
- Variables: Prefix variables with
@
to use them in queries. - Complex Expressions: Enclose complex expressions in parentheses for clarity.
Performance Considerations
While query
is efficient for large datasets, it might not always be the fastest option for very simple or small datasets due to the overhead of parsing the query string. Always profile your code to make the best choice.
As you become more comfortable with filtering data using the query
function, you’ll find that it lays a strong foundation for other crucial data manipulation tasks. One key area where these skills are particularly applicable is data cleaning – an essential step in preparing your data for analysis.
For those eager to delve deeper into this topic, I highly recommend exploring further through instructional videos and tutorials. In fact, here’s a great resource to get you started on your journey into data cleaning: