Filter Data with Python Query

I believe there a several Python functions that every data analyst should learn. VALUE_COUNTS, GROUPBY and QUERY is also on the list. Of course all these are part of the PANDAS library. So Let’s explore the query function. This function allows for filtering data frames using a concise and readable syntax.

What is the basic query syntax

  1. import Pandas
  2. Convert your data to a dataframe
  3. Use the query function to filter your data with dataframe.query(‘contidion’)

Why You Should Use Query

Of course, you can definitely filter data using bracket or filter notation. So here are few reasons why you should use query.

  1. Readability: The query syntax is more readable, especially for those familiar with SQL.
  2. Convenience: It can make code shorter and more concise.
  3. Performance: For large datasets, query can be more efficient than traditional methods.

Let’s say you have the following DataFrame named df:

NameAgeGender
0John28Male
1Anna34Female
2Peter29Male
3Linda40Female

If you want to select only the rows where the Age is greater than 30, you can use the query function as follows:

result = df.query('Age > 30')

This will give you:

NameAgeGender
1Anna34Female
3Linda40Female

Lets Start with Some Practice Data

import pandas as pd

# Sample data
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 34, 29, 40],
        'Gender': ['Male', 'Female', 'Male', 'Female'],
        'Income': [70000, 80000, 120000, 110000]}

df = pd.DataFrame(data)

Traditional Dataset Filtering

We can you bracket notation to filter data easily. Fore example if we anted to know filter by both gender and income, this would require a code that mirrors the one in the below box.

filtered_df = df[(df['Gender'] == 'Female') & (df['Income'] > 80000)]

Let’s measure this against the simplicity of using the query function syntax which is much cleaner and readable.

filtered_df = df.query("Gender == 'Female' and Income > 80000")

Advanced Filtering

The query function shines in more complex scenarios. For instance, you can use variables in your queries.

Using Variables in Queries and Complex Queries

Suppose you have a variable min_income:

min_income = 80000
filtered_df = df.query("Income > @min_income")

filtered_df = df.query("Age < 35 and Gender == 'Male' | Income >= 100000")

Watch a YouTube Video on Using Query

Tips for Using the Query Function

  • String Columns: For string comparisons, make sure to use single quotes within the double-quoted query string.
  • Variables: Prefix variables with @ to use them in queries.
  • Complex Expressions: Enclose complex expressions in parentheses for clarity.

Performance Considerations

While query is efficient for large datasets, it might not always be the fastest option for very simple or small datasets due to the overhead of parsing the query string. Always profile your code to make the best choice.

As you become more comfortable with filtering data using the query function, you’ll find that it lays a strong foundation for other crucial data manipulation tasks. One key area where these skills are particularly applicable is data cleaning – an essential step in preparing your data for analysis.

For those eager to delve deeper into this topic, I highly recommend exploring further through instructional videos and tutorials. In fact, here’s a great resource to get you started on your journey into data cleaning:

Gaelim Holland

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments