How do I select rows from a DataFrame based on column values?

How can I select rows from a DataFrame based on values in some column in Pandas?

In SQL, I would use:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>SELECT *
FROM table
WHERE column_name = some_value
</code>
<code>SELECT * FROM table WHERE column_name = some_value </code>
SELECT *
FROM table
WHERE column_name = some_value

0

To select rows whose column value equals a scalar, some_value, use ==:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[df['column_name'] == some_value]
</code>
<code>df.loc[df['column_name'] == some_value] </code>
df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[df['column_name'].isin(some_values)]
</code>
<code>df.loc[df['column_name'].isin(some_values)] </code>
df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
</code>
<code>df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)] </code>
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python’s operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df['column_name'] >= A & df['column_name'] <= B
</code>
<code>df['column_name'] >= A & df['column_name'] <= B </code>
df['column_name'] >= A & df['column_name'] <= B

is parsed as

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df['column_name'] >= (A & df['column_name']) <= B
</code>
<code>df['column_name'] >= (A & df['column_name']) <= B </code>
df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[df['column_name'] != some_value]
</code>
<code>df.loc[df['column_name'] != some_value] </code>
df.loc[df['column_name'] != some_value]

The isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df = df.loc[~df['column_name'].isin(some_values)] # .loc is not in-place replacement
</code>
<code>df = df.loc[~df['column_name'].isin(some_values)] # .loc is not in-place replacement </code>
df = df.loc[~df['column_name'].isin(some_values)] # .loc is not in-place replacement

For example,

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
</code>
<code>import pandas as pd import numpy as np df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split(), 'C': np.arange(8), 'D': np.arange(8) * 2}) print(df) # A B C D # 0 foo one 0 0 # 1 bar one 1 2 # 2 foo two 2 4 # 3 bar three 3 6 # 4 foo two 4 8 # 5 bar two 5 10 # 6 foo one 6 12 # 7 foo three 7 14 print(df.loc[df['A'] == 'foo']) </code>
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code> A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>print(df.loc[df['B'].isin(['one','three'])])
</code>
<code>print(df.loc[df['B'].isin(['one','three'])]) </code>
print(df.loc[df['B'].isin(['one','three'])])

yields

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
</code>
<code> A B C D 0 foo one 0 0 1 bar one 1 2 3 bar three 3 6 6 foo one 6 12 7 foo three 7 14 </code>
     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df = df.set_index(['B'])
print(df.loc['one'])
</code>
<code>df = df.set_index(['B']) print(df.loc['one']) </code>
df = df.set_index(['B'])
print(df.loc['one'])

yields

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
</code>
<code> A C D B one foo 0 0 one bar 1 2 one foo 6 12 </code>
       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[df.index.isin(['one','two'])]
</code>
<code>df.loc[df.index.isin(['one','two'])] </code>
df.loc[df.index.isin(['one','two'])]

yields

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
</code>
<code> A C D B one foo 0 0 one bar 1 2 two foo 2 4 two foo 4 8 two bar 5 10 one foo 6 12 </code>
       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

3

There are several ways to select rows from a Pandas dataframe:

  1. Boolean indexing (df[df['col'] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...) API

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'foo'

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.)


Setup

The first thing we’ll need is to identify a condition that will act as our criterion for selecting rows. We’ll start with the OP’s case column_name == some_value, and include some other common use cases.

Borrowing from @unutbu:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
</code>
<code>import pandas as pd, numpy as np df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split(), 'C': np.arange(8), 'D': np.arange(8) * 2}) </code>
import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

1. Boolean indexing

… Boolean indexing requires finding the true value of each row’s 'A' column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we’d name this series, an array of truth values, mask. We’ll do so here as well.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = df['A'] == 'foo'
</code>
<code>mask = df['A'] == 'foo' </code>
mask = df['A'] == 'foo'

We can then use this mask to slice or index the data frame

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>df[mask] A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn’t an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.


2. Positional indexing

Positional indexing (df.iloc[...]) has its use cases, but this isn’t one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>mask = df['A'] == 'foo' pos = np.flatnonzero(mask) df.iloc[pos] A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

Label indexing can be very handy, but in this case, we are again doing more work for no benefit

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.set_index('A', append=True, drop=False).xs('foo', level=1)
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>df.set_index('A', append=True, drop=False).xs('foo', level=1) A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.query('A == "foo"')
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>df.query('A == "foo"') A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask.

mask alternative 1
Use the underlying NumPy array and forgo the overhead of creating another pd.Series

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = df['A'].values == 'foo'
</code>
<code>mask = df['A'].values == 'foo' </code>
mask = df['A'].values == 'foo'

I’ll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
</code>
<code>%timeit mask = df['A'].values == 'foo' %timeit mask = df['A'] == 'foo' 5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code>
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the NumPy array is ~ 30 times faster. This is partly due to NumPy evaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.

Next, we’ll look at the timing for slicing with one mask versus the other.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
</code>
<code>mask = df['A'].values == 'foo' %timeit df[mask] mask = df['A'] == 'foo' %timeit df[mask] 219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </code>
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren’t as pronounced. We’ll see if this holds up over more robust testing.


mask alternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!

Instead of df[mask] we will do this

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
</code>
<code>pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes) </code>
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object. Thus requiring the astype(df.dtypes) and killing any potential performance gains.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
</code>
<code>%timeit df[m] %timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes) 216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </code>
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

Given

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
d1
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
</code>
<code>np.random.seed([3,1415]) d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE')) d1 A B C D E 0 0 2 7 3 8 1 7 0 6 8 6 2 0 2 0 4 9 3 7 3 2 4 3 4 3 6 7 7 4 5 5 3 7 5 9 6 8 7 6 4 7 7 6 2 6 6 5 8 2 8 7 5 8 9 4 7 6 1 5 </code>
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>%%timeit
mask = d1['A'].values == 7
d1[mask]
179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
</code>
<code>%%timeit mask = d1['A'].values == 7 d1[mask] 179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code>
%%timeit
mask = d1['A'].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
</code>
<code>%%timeit mask = d1['A'].values == 7 pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns) 87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code>
%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.


mask alternative 3

@unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = df['A'].isin(['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>mask = df['A'].isin(['foo']) df[mask] A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing. We’ll use np.in1d

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>mask = np.in1d(df['A'].values, ['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>mask = np.in1d(df['A'].values, ['foo']) df[mask] A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing

I’ll include other concepts mentioned in other posts as well for reference.

Code Below

Each column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>res.div(res.min())
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
</code>
<code>res.div(res.min()) 10 30 100 300 1000 3000 10000 30000 mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151 mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103 mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919 mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000 query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190 xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255 mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760 mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175 </code>
res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You’ll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>res.T.plot(loglog=True)
</code>
<code>res.T.plot(loglog=True) </code>
res.T.plot(loglog=True)

Functions

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
</code>
<code>def mask_standard(df): mask = df['A'] == 'foo' return df[mask] def mask_standard_loc(df): mask = df['A'] == 'foo' return df.loc[mask] def mask_with_values(df): mask = df['A'].values == 'foo' return df[mask] def mask_with_values_loc(df): mask = df['A'].values == 'foo' return df.loc[mask] def query(df): return df.query('A == "foo"') def xs_label(df): return df.set_index('A', append=True, drop=False).xs('foo', level=-1) def mask_with_isin(df): mask = df['A'].isin(['foo']) return df[mask] def mask_with_in1d(df): mask = np.in1d(df['A'].values, ['foo']) return df[mask] </code>
def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

Testing

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>res = pd.DataFrame(
index=[
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
</code>
<code>res = pd.DataFrame( index=[ 'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc', 'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d' ], columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000], dtype=float ) for j in res.columns: d = pd.concat([df] * j, ignore_index=True) for i in res.index:a stmt = '{}(d)'.format(i) setp = 'from __main__ import d, {}'.format(i) res.at[i, j] = timeit(stmt, setp, number=50) </code>
res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing

Looking at the special case when we have a single non-object dtype for the entire data frame.

Code Below

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>spec.div(spec.min())
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
</code>
<code>spec.div(spec.min()) 10 30 100 300 1000 3000 10000 30000 mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000 mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100 reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735 </code>
spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn’t worth it past a few hundred rows.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>spec.T.plot(loglog=True)
</code>
<code>spec.T.plot(loglog=True) </code>
spec.T.plot(loglog=True)

Functions

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
</code>
<code>np.random.seed([3,1415]) d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE')) def mask_with_values(df): mask = df['A'].values == 'foo' return df[mask] def mask_with_in1d(df): mask = np.in1d(df['A'].values, ['foo']) return df[mask] def reconstruct(df): v = df.values mask = np.in1d(df['A'].values, ['foo']) return pd.DataFrame(v[mask], df.index[mask], df.columns) spec = pd.DataFrame( index=['mask_with_values', 'mask_with_in1d', 'reconstruct'], columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000], dtype=float ) </code>
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)
</code>
<code>for j in spec.columns: d = pd.concat([df] * j, ignore_index=True) for i in spec.index: stmt = '{}(d)'.format(i) setp = 'from __main__ import d, {}'.format(i) spec.at[i, j] = timeit(stmt, setp, number=50) </code>
for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

1

tl;dr

The Pandas equivalent to

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>select * from table where column_name = some_value
</code>
<code>select * from table where column_name = some_value </code>
select * from table where column_name = some_value

is

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>table[table.column_name == some_value]
</code>
<code>table[table.column_name == some_value] </code>
table[table.column_name == some_value]

Multiple conditions:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
</code>
<code>table[(table.column_name == some_value) | (table.column_name2 == some_value2)] </code>
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

or

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>table.query('column_name == some_value | column_name2 == some_value2')
</code>
<code>table.query('column_name == some_value | column_name2 == some_value2') </code>
table.query('column_name == some_value | column_name2 == some_value2')

Code example

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
</code>
<code>import pandas as pd # Create data set d = {'foo':[100, 111, 222], 'bar':[333, 444, 555]} df = pd.DataFrame(d) # Full dataframe: df # Shows: # bar foo # 0 333 100 # 1 444 111 # 2 555 222 # Output only the row(s) in df where foo is 222: df[df.foo == 222] # Shows: # bar foo # 2 555 222 </code>
import pandas as pd

# Create data set
d = {'foo':[100, 111, 222],
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.

Multiple conditions are also possible:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
</code>
<code>df[(df.foo == 222) | (df.bar == 444)] # bar foo # 1 444 111 # 2 555 222 </code>
df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the query function, since it’s less verbose and yields the same result:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.query('foo == 222 | bar == 444')
</code>
<code>df.query('foo == 222 | bar == 444') </code>
df.query('foo == 222 | bar == 444')

0

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val').

Reproduced from The query() Method (Experimental):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
</code>
<code>In [167]: n = 10 In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc')) In [169]: df Out[169]: a b c 0 0.687704 0.582314 0.281645 1 0.250846 0.610021 0.420121 2 0.624328 0.401816 0.932146 3 0.011763 0.022921 0.244186 4 0.590198 0.325680 0.890392 5 0.598892 0.296424 0.007312 6 0.634625 0.803069 0.123872 7 0.924168 0.325076 0.303746 8 0.116822 0.364564 0.454607 9 0.986142 0.751953 0.561512 # pure python In [170]: df[(df.a < df.b) & (df.b < df.c)] Out[170]: a b c 3 0.011763 0.022921 0.244186 8 0.116822 0.364564 0.454607 # query In [171]: df.query('(a < b) & (b < c)') Out[171]: a b c 3 0.011763 0.022921 0.244186 8 0.116822 0.364564 0.454607 </code>
In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]:
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>exclude = ('red', 'orange')
df.query('color not in @exclude')
</code>
<code>exclude = ('red', 'orange') df.query('color not in @exclude') </code>
exclude = ('red', 'orange')
df.query('color not in @exclude')

0

More flexibility using .query with pandas >= 0.25.0:

Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) – see GitHub:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code># Example dataframe
df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]})
Sender email
0 [email protected]
1 [email protected]
2 [email protected]
</code>
<code># Example dataframe df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]}) Sender email 0 [email protected] 1 [email protected] 2 [email protected] </code>
# Example dataframe
df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]})

     Sender email
0  [email protected]
1  [email protected]
2    [email protected]

Using .query with method str.endswith:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.query('`Sender email`.str.endswith("@shop.com")')
</code>
<code>df.query('`Sender email`.str.endswith("@shop.com")') </code>
df.query('`Sender email`.str.endswith("@shop.com")')

Output

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> Sender email
1 [email protected]
2 [email protected]
</code>
<code> Sender email 1 [email protected] 2 [email protected] </code>
     Sender email
1  [email protected]
2    [email protected]

Also we can use local variables by prefixing it with an @ in our query:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')
</code>
<code>domain = 'shop.com' df.query('`Sender email`.str.endswith(@domain)') </code>
domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

Output

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> Sender email
1 [email protected]
2 [email protected]
</code>
<code> Sender email 1 [email protected] 2 [email protected] </code>
     Sender email
1  [email protected]
2    [email protected]

0

For selecting only specific columns out of multiple columns for a given value in Pandas:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>select col_name1, col_name2 from table where column_name = some_value.
</code>
<code>select col_name1, col_name2 from table where column_name = some_value. </code>
select col_name1, col_name2 from table where column_name = some_value.

Options loc:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
</code>
<code>df.loc[df['column_name'] == some_value, [col_name1, col_name2]] </code>
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]

or query:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.query('column_name == some_value')[[col_name1, col_name2]]
</code>
<code>df.query('column_name == some_value')[[col_name1, col_name2]] </code>
df.query('column_name == some_value')[[col_name1, col_name2]]

0

In newer versions of Pandas, inspired by the documentation (Viewing data):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[df["colume_name"] == some_value] #Scalar, True/False..
df[df["colume_name"] == "some_value"] #String
</code>
<code>df[df["colume_name"] == some_value] #Scalar, True/False.. df[df["colume_name"] == "some_value"] #String </code>
df[df["colume_name"] == some_value] #Scalar, True/False..

df[df["colume_name"] == "some_value"] #String

Combine multiple conditions by putting the clause in parentheses, (), and combining them with & and | (and/or). Like this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]
</code>
<code>df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])] </code>
df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]

Other filters

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase
</code>
<code>pandas.notna(df["colume_name"]) == True # Not NaN df['colume_name'].str.contains("text") # Search for "text" df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase </code>
pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting  to lowercase

0

Faster results can be achieved using numpy.where.

For example, with unubtu’s setup –

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
</code>
<code>In [76]: df.iloc[np.where(df.A.values=='foo')] Out[76]: A B C D 0 foo one 0 0 2 foo two 2 4 4 foo two 4 8 6 foo one 6 12 7 foo three 7 14 </code>
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing comparisons:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
</code>
<code>In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest 1000 loops, best of 3: 380 µs per loop In [69]: %timeit df.loc[df['A'] == 'foo'] 1000 loops, best of 3: 745 µs per loop In [71]: %timeit df.loc[df['A'].isin(['foo'])] 1000 loops, best of 3: 562 µs per loop In [72]: %timeit df[df.A=='foo'] 1000 loops, best of 3: 796 µs per loop In [74]: %timeit df.query('(A=="foo")') # slowest 1000 loops, best of 3: 1.71 ms per loop </code>
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 µs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

Here is a simple example

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111
</code>
<code>from pandas import DataFrame # Create data set d = {'Revenue':[100,111,222], 'Cost':[333,444,555]} df = DataFrame(d) # mask = Return True when the value in column "Revenue" is equal to 111 mask = df['Revenue'] == 111 print mask # Result: # 0 False # 1 True # 2 False # Name: Revenue, dtype: bool # Select * FROM df WHERE Revenue = 111 df[mask] # Result: # Cost Revenue # 1 444 111 </code>
from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

To add: You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.,

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
</code>
<code>import pandas as pd df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split()}) print("Original dataframe:") print(df) b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) #NOTE: the final drop is to remove the extra index column returned by groupby object print('Sub dataframe where B is two:') print(b_is_two_dataframe) </code>
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Running this gives:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
</code>
<code>Original dataframe: A B 0 foo one 1 bar one 2 foo two 3 bar three 4 foo two 5 bar two 6 foo one 7 foo three Sub dataframe where B is two: A B 0 foo two 1 foo two 2 bar two </code>
Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

1

1. Use f-strings inside query() calls

If the column name used to filter your dataframe comes from a local variable, f-strings may be useful. For example,

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>col = 'A'
df.query(f"{col} == 'foo'")
</code>
<code>col = 'A' df.query(f"{col} == 'foo'") </code>
col = 'A'
df.query(f"{col} == 'foo'")

In fact, f-strings can be used for the query variable as well (except for datetime):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>col = 'A'
my_var = 'foo'
df.query(f"{col} == '{my_var}'") # if my_var is a string
my_num = 1
df.query(f"{col} == {my_num}") # if my_var is a number
my_date = '2022-12-10'
df.query(f"{col} == @my_date") # must use @ for datetime though
</code>
<code>col = 'A' my_var = 'foo' df.query(f"{col} == '{my_var}'") # if my_var is a string my_num = 1 df.query(f"{col} == {my_num}") # if my_var is a number my_date = '2022-12-10' df.query(f"{col} == @my_date") # must use @ for datetime though </code>
col = 'A'
my_var = 'foo'
df.query(f"{col} == '{my_var}'") # if my_var is a string

my_num = 1
df.query(f"{col} == {my_num}") # if my_var is a number

my_date = '2022-12-10'
df.query(f"{col} == @my_date") # must use @ for datetime though

2. Install numexpr to speed up query() calls

The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.

For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.

Logical and/or comparison operators on columns of strings

If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it’s 30% faster1 and for a dataframe with 800k rows, it’s 60% faster.2

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[df.A == 'foo']
df.query("A == 'foo'") # <--- performs 30%-60% faster
</code>
<code>df[df.A == 'foo'] df.query("A == 'foo'") # <--- performs 30%-60% faster </code>
df[df.A == 'foo']
df.query("A == 'foo'")  # <--- performs 30%-60% faster

This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2

Multiple operations on numeric columns

If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it’s 20% faster1 and for a frame with 800k rows, it’s 2 times faster.2

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster.
</code>
<code>df[(df.B % 5) **2 < 0.1] df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster. </code>
df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1")  # <--- performs 20%-100% faster.

This gap in performance increases as the number of operations increases and/or the dataframe length increases.2

The following plot shows how the methods perform as the dataframe length increases.3

3. Call pandas methods inside query()

Numexpr currently supports only logical (&, |, ~), comparison (==, >, <, >=, <=, !=) and basic arithmetic operators (+, -, *, /, **, %).

For example, it doesn’t support integer division (//). However, calling the equivalent pandas method (floordiv()) works.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.query('B.floordiv(2) <= 3') # or
df.query('B.floordiv(2).le(3)')
# for pandas < 1.4, need `.values`
df.query('B.floordiv(2).values <= 3')
</code>
<code>df.query('B.floordiv(2) <= 3') # or df.query('B.floordiv(2).le(3)') # for pandas < 1.4, need `.values` df.query('B.floordiv(2).values <= 3') </code>
df.query('B.floordiv(2) <= 3')  # or 
df.query('B.floordiv(2).le(3)')

# for pandas < 1.4, need `.values`
df.query('B.floordiv(2).values <= 3')


1 Benchmark code using a frame with 80k rows

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code>
<code>import numpy as np df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000, 'B': np.random.rand(80000)}) %timeit df[df.A == 'foo'] # 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit df.query("A == 'foo'") # 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))] # 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'") # 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df[(df.B % 5) **2 < 0.1] # 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit df.query("(B % 5) **2 < 0.1") # 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code>
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000, 
                   'B': np.random.rand(80000)})

%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2 Benchmark code using a frame with 800k rows

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
</code>
<code>df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000, 'B': np.random.rand(800000)}) %timeit df[df.A == 'foo'] # 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df.query("A == 'foo'") # 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))] # 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'") # 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df[(df.B % 5) **2 < 0.1] # 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) %timeit df.query("(B % 5) **2 < 0.1") # 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each) </code>
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000, 
                   'B': np.random.rand(800000)})

%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

3: Code used to produce the performance graphs of the two methods for strings and numbers.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
setup=constructor,
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
equality_check=pd.DataFrame.equals);
plot(
setup=constructor,
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
equality_check=pd.DataFrame.equals);
</code>
<code>from perfplot import plot constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)}) plot( setup=constructor, kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")], labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'], n_range=[2**k for k in range(4, 24)], xlabel='Rows in DataFrame', title='Multiple mathematical operations on numbers', equality_check=pd.DataFrame.equals); plot( setup=constructor, kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")], labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""], n_range=[2**k for k in range(4, 24)], xlabel='Rows in DataFrame', title='Comparison operation on strings', equality_check=pd.DataFrame.equals); </code>
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
    setup=constructor,
    kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
    labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Multiple mathematical operations on numbers',
    equality_check=pd.DataFrame.equals);
plot(
    setup=constructor,
    kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
    labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Comparison operation on strings',
    equality_check=pd.DataFrame.equals);

You can also use .apply:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df.apply(lambda row: row[df['B'].isin(['one','three'])])
</code>
<code>df.apply(lambda row: row[df['B'].isin(['one','three'])]) </code>
df.apply(lambda row: row[df['B'].isin(['one','three'])])

It actually works row-wise (i.e., applies the function to each row).

The output is

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
</code>
<code> A B C D 0 foo one 0 0 1 bar one 1 2 3 bar three 3 6 6 foo one 6 12 7 foo three 7 14 </code>
   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

The results is the same as using as mentioned by @unutbu

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[[df['B'].isin(['one','three'])]]
</code>
<code>df[[df['B'].isin(['one','three'])]] </code>
df[[df['B'].isin(['one','three'])]]

If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
</code>
<code>my_df = df.set_index(column_name) my_dict = my_df.to_dict('index') </code>
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')

After make my_dict dictionary you can go through:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>if some_value in my_dict.keys():
my_result = my_dict[some_value]
</code>
<code>if some_value in my_dict.keys(): my_result = my_dict[some_value] </code>
if some_value in my_dict.keys():
   my_result = my_dict[some_value]

If you have duplicated values in column_name you can’t make a dictionary. but you can use:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>my_result = my_df.loc[some_value]
</code>
<code>my_result = my_df.loc[some_value] </code>
my_result = my_df.loc[some_value]

You can use loc (square brackets) with a function:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code># Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
</code>
<code># Series s = pd.Series([1, 2, 3, 4]) s.loc[lambda x: x > 1] # s[lambda x: x > 1] </code>
# Series
s = pd.Series([1, 2, 3, 4]) 
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]

Output:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>1 2
2 3
3 4
dtype: int64
</code>
<code>1 2 2 3 3 4 dtype: int64 </code>
1    2
2    3
3    4
dtype: int64

or

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code># DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df[lambda x: (x['A'] != 1) & (x['B'] != 30)]
</code>
<code># DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]}) df[lambda x: (x['A'] != 1) & (x['B'] != 30)] </code>
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df[lambda x: (x['A'] != 1) & (x['B'] != 30)]

Output:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> A B
1 2 20
</code>
<code> A B 1 2 20 </code>
   A   B
1  2  20

1

SQL statements on DataFrames to select rows using DuckDB

With DuckDB we can query pandas DataFrames with SQL statements, in a highly performant way.

Since the question is How do I select rows from a DataFrame based on column values?, and the example in the question is a SQL query, this answer looks logical in this topic.

Example:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
Out[5]:
A B
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
Out[7]:
A B
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21
</code>
<code>In [1]: import duckdb In [2]: import pandas as pd In [3]: con = duckdb.connect() In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)}) In [5]: df Out[5]: A B 0 0 11 1 1 12 2 2 13 3 3 14 4 4 15 5 5 16 6 6 17 7 7 18 8 8 19 9 9 20 10 10 21 In [6]: results = con.execute("SELECT * FROM df where A > 2").df() In [7]: results Out[7]: A B 0 3 14 1 4 15 2 5 16 3 6 17 4 7 18 5 8 19 6 9 20 7 10 21 </code>
In [1]: import duckdb

In [2]: import pandas as pd

In [3]: con = duckdb.connect()

In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})

In [5]: df
Out[5]:
     A   B
0    0  11
1    1  12
2    2  13
3    3  14
4    4  15
5    5  16
6    6  17
7    7  18
8    8  19
9    9  20
10  10  21

In [6]: results = con.execute("SELECT * FROM df where A > 2").df()

In [7]: results
Out[7]:
    A   B
0   3  14
1   4  15
2   5  16
3   6  17
4   7  18
5   8  19
6   9  20
7  10  21

Here are options using pandas built-in functions, similar to isin.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> df = pd.DataFrame({'cost': [250, 150, 100], 'revenue': [100, 250, 300]},index=['A', 'B', 'C'])
cost revenue
A 250 100
B 150 250
C 100 300
</code>
<code> df = pd.DataFrame({'cost': [250, 150, 100], 'revenue': [100, 250, 300]},index=['A', 'B', 'C']) cost revenue A 250 100 B 150 250 C 100 300 </code>
 df = pd.DataFrame({'cost': [250, 150, 100], 'revenue': [100, 250, 300]},index=['A', 'B', 'C'])

   cost  revenue
A   250      100
B   150      250
C   100      300

Compare DataFrames for equality elementwise

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code> df[df["cost"].eq(250)]
cost revenue
A 250 100
</code>
<code> df[df["cost"].eq(250)] cost revenue A 250 100 </code>
 df[df["cost"].eq(250)]

   cost  revenue
A   250      100

Compare DataFrames for greater than inequality or equality elementwise.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[df["cost"].ge(100)]
cost revenue
A 250 100
B 150 250
C 100 300
</code>
<code>df[df["cost"].ge(100)] cost revenue A 250 100 B 150 250 C 100 300 </code>
df[df["cost"].ge(100)]

   cost  revenue
A   250      100
B   150      250
C   100      300

Compare DataFrames for strictly less than inequality elementwise.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>df[df["cost"].lt(200)]
cost revenue
B 150 250
C 100 300
</code>
<code>df[df["cost"].lt(200)] cost revenue B 150 250 C 100 300 </code>
df[df["cost"].lt(200)]

   cost  revenue
B   150      250
C   100      300

Great answers. Only, when the size of the dataframe approaches million rows, many of the methods tend to take ages when using df[df['col']==val]. I wanted to have all possible values of “another_column” that correspond to specific values in “some_column” (in this case in a dictionary). This worked and fast.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>s=datetime.datetime.now()
my_dict={}
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
my_dict[my_key]={}
my_dict[my_key]['values']=[df.iloc[i]['another_column']]
else:
my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
e=datetime.datetime.now()
print('operation took '+str(e-s)+' seconds')```
</code>
<code>s=datetime.datetime.now() my_dict={} for i, my_key in enumerate(df['some_column'].values): if i%100==0: print(i) # to see the progress if my_key not in my_dict.keys(): my_dict[my_key]={} my_dict[my_key]['values']=[df.iloc[i]['another_column']] else: my_dict[my_key]['values'].append(df.iloc[i]['another_column']) e=datetime.datetime.now() print('operation took '+str(e-s)+' seconds')``` </code>
s=datetime.datetime.now()

my_dict={}

for i, my_key in enumerate(df['some_column'].values): 
    if i%100==0:
        print(i)  # to see the progress
    if my_key not in my_dict.keys():
        my_dict[my_key]={}
        my_dict[my_key]['values']=[df.iloc[i]['another_column']]
    else:
        my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
        
e=datetime.datetime.now()

print('operation took '+str(e-s)+' seconds')```

You can just use based on boolean indexing, just selecting rows on Pandas lib. SQL will be like this, for example:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd
# create dataset
data = {
'ramal': [991, 990, 989, 988, 987],
'wave': ['p', 'q', 'r', 'v', 's'],
'suite': [102, 118, 301, 206, 209]
}
df = pd.DataFrame(data)
</code>
<code>import pandas as pd # create dataset data = { 'ramal': [991, 990, 989, 988, 987], 'wave': ['p', 'q', 'r', 'v', 's'], 'suite': [102, 118, 301, 206, 209] } df = pd.DataFrame(data) </code>
import pandas as pd
# create dataset
data = {
    'ramal': [991, 990, 989, 988, 987],
    'wave': ['p', 'q', 'r', 'v', 's'],
    'suite': [102, 118, 301, 206, 209]
}

df = pd.DataFrame(data)

Knowing column equals ‘p’, you might select rows which wave, is like for those example, as I show on position of some “wave” parameter.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>result = df[df['wave'] == 'p']
</code>
<code>result = df[df['wave'] == 'p'] </code>
result = df[df['wave'] == 'p']

based on the question could be use the general one

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>result = df.query('col == @val')
</code>
<code>result = df.query('col == @val') </code>
result = df.query('col == @val')

So, the indexing of syntax select rows on a column value.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>result = df[df['column_wave'] == some_value]
</code>
<code>result = df[df['column_wave'] == some_value] </code>
result = df[df['column_wave'] == some_value]

Something that I find very useful while handling messy dfs is the possibility to use a combination of both regex and the contains() function, in order to identify certain patterns, as follows:

🔍 Finding pattern X/X12 (where “X” can be anything)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>import pandas as pd
import numpy as np
column_names = [
'id_num',
'abc/def12',
'test/test12',
'pattern/12312',
'no_match',
'other_column',
'qwerty/asdfgh12'
]
# DATAFRAME
data = np.random.rand(2, len(column_names))
df = pd.DataFrame(data, columns=column_names)
# REGEX PATTERN
pattern = r'.+/.*12$'
# SELECT COLUMNS
selected_columns = df.columns[(df.columns.str.contains(pattern)) | (df.columns == 'id_num')]
df_selected = df[selected_columns]
</code>
<code>import pandas as pd import numpy as np column_names = [ 'id_num', 'abc/def12', 'test/test12', 'pattern/12312', 'no_match', 'other_column', 'qwerty/asdfgh12' ] # DATAFRAME data = np.random.rand(2, len(column_names)) df = pd.DataFrame(data, columns=column_names) # REGEX PATTERN pattern = r'.+/.*12$' # SELECT COLUMNS selected_columns = df.columns[(df.columns.str.contains(pattern)) | (df.columns == 'id_num')] df_selected = df[selected_columns] </code>
import pandas as pd
import numpy as np

column_names = [
    'id_num',
    'abc/def12',
    'test/test12',
    'pattern/12312',
    'no_match',
    'other_column',
    'qwerty/asdfgh12'
]

# DATAFRAME
data = np.random.rand(2, len(column_names))
df = pd.DataFrame(data, columns=column_names)

# REGEX PATTERN
pattern = r'.+/.*12$'

# SELECT COLUMNS
selected_columns = df.columns[(df.columns.str.contains(pattern)) | (df.columns == 'id_num')]
df_selected = df[selected_columns]

Original DF:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>| | id_num | abc/def12 | test/test12 | pattern/12312 | no_match | other_column | qwerty/asdfgh12 |
|---:|---------:|------------:|--------------:|----------------:|-----------:|---------------:|------------------:|
| 0 | 0.344517 | 0.481556 | 0.215228 | 0.962699 | 0.393522 | 0.946116 | 0.672584 |
| 1 | 0.277832 | 0.674059 | 0.0142862 | 0.430597 | 0.00497118 | 0.948237 | 0.90443 |
</code>
<code>| | id_num | abc/def12 | test/test12 | pattern/12312 | no_match | other_column | qwerty/asdfgh12 | |---:|---------:|------------:|--------------:|----------------:|-----------:|---------------:|------------------:| | 0 | 0.344517 | 0.481556 | 0.215228 | 0.962699 | 0.393522 | 0.946116 | 0.672584 | | 1 | 0.277832 | 0.674059 | 0.0142862 | 0.430597 | 0.00497118 | 0.948237 | 0.90443 | </code>
|    |   id_num |   abc/def12 |   test/test12 |   pattern/12312 |   no_match |   other_column |   qwerty/asdfgh12 |
|---:|---------:|------------:|--------------:|----------------:|-----------:|---------------:|------------------:|
|  0 | 0.344517 |    0.481556 |     0.215228  |        0.962699 | 0.393522   |       0.946116 |          0.672584 |
|  1 | 0.277832 |    0.674059 |     0.0142862 |        0.430597 | 0.00497118 |       0.948237 |          0.90443  |

Final DF:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>| | id_num | abc/def12 | test/test12 | pattern/12312 | qwerty/asdfgh12 |
|---:|---------:|------------:|--------------:|----------------:|------------------:|
| 0 | 0.344517 | 0.481556 | 0.215228 | 0.962699 | 0.672584 |
| 1 | 0.277832 | 0.674059 | 0.0142862 | 0.430597 | 0.90443 |
</code>
<code>| | id_num | abc/def12 | test/test12 | pattern/12312 | qwerty/asdfgh12 | |---:|---------:|------------:|--------------:|----------------:|------------------:| | 0 | 0.344517 | 0.481556 | 0.215228 | 0.962699 | 0.672584 | | 1 | 0.277832 | 0.674059 | 0.0142862 | 0.430597 | 0.90443 | </code>
|    |   id_num |   abc/def12 |   test/test12 |   pattern/12312 |   qwerty/asdfgh12 |
|---:|---------:|------------:|--------------:|----------------:|------------------:|
|  0 | 0.344517 |    0.481556 |     0.215228  |        0.962699 |          0.672584 |
|  1 | 0.277832 |    0.674059 |     0.0142862 |        0.430597 |          0.90443  |

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật