Why use loc in Pandas?

Why do we use loc for pandas dataframes? it seems the following code with or without using loc both compiles and runs at a similar speed:

%timeit df_user1 = df.loc[df.user_id=='5561']

100 loops, best of 3: 11.9 ms per loop

%timeit df_user1_noloc = df[df.user_id=='5561']

100 loops, best of 3: 12 ms per loop

So why use loc?

Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that

you can do column retrieval just by using the data frame’s
__getitem__:
df['time']    # equivalent to df.loc[:, 'time']

it does not say why we use loc, although it does explain lots of features of loc. But my specific question is: why not just omit loc altogether? For this question, I have accepted a very detailed answer below.

Also in the above post, the answer (which I do not think is an answer) is really well hidden in the discussion, and any person searching for what I was, would find it hard to locate the information and would be much better served by the answer provided to my question here.

Explicit is better than implicit.

df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:
```
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]: 
   False  True 
0      3      1
1      4      2
2      5      3
```
You might want to use df[[True]] to select the True column. Instead it raises a ValueError:
```
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
```
Versus using loc:
```
In [231]: df.loc[[True]]
Out[231]: 
   False  True 
0      3      1
```
In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:
```
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]: 
   A  B
0  1  3
1  2  4
2  3  5

In [259]: df2[['B']]
Out[259]: 
   B
0  3
1  4
2  5
```
Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc‘s syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).
```
In [237]: df2.loc[[True,False,True], 'B']
Out[237]: 
0    3
2    5
Name: B, dtype: int64
```
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:
```
In [239]: df2.loc[1:2]
Out[239]: 
   A  B
1  2  4
2  3  5

In [271]: df2[1:2]
Out[271]: 
   A  B
1  2  4
```

Performance Consideration on multiple columns “Chained Assignment” with and without using .loc

Let me supplement the already very good answers with the consideration of system performance.

The question itself includes a comparison on the system performance (execution time) of 2 pieces of codes with and without using .loc. The execution times are roughly the same for the code samples quoted. However, for some other code samples, there could be considerable difference on execution times with and without using .loc: e.g. several times difference or more!

A common case of pandas dataframe manipulation is we need to create a new column derived from values of an existing column. We may use the codes below to filter conditions (based on existing column) and set different values to the new column:

df[df['mark'] >= 50]['text_rating'] = 'Pass'

However, this kind of “Chained Assignment” does not work since it could create a “copy” instead of a “view” and assignment to the new column based on this “copy” will not update the original dataframe.

2 options available:

1. We can either use .loc, or
1. Code it another way without using .loc

2nd case e.g.:

df['text_rating'][df['mark'] >= 50] = 'Pass'

By placing the filtering at the last (after specifying the new column name), the assignment works well with the original dataframe updated.

The solution using .loc is as follows:

df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'

Now, let’s see their execution time:

Without using .loc:

%%timeit 
df['text_rating'][df['mark'] >= 50] = 'Pass'

2.01 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With using .loc:

%%timeit 
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'

577 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As we can see, with using .loc, the execution time is more than 3X times faster!

For a more detailed explanation of “Chained Assignment”, you can refer to another related post How to deal with SettingWithCopyWarning in pandas? and in particular the answer of cs95. The post is excellent in explaining the functional differences of using .loc. I just supplement here the system performance (execution time) difference.

In addition to what has already been said (issues with having True, False as column name without using loc and ability to select rows and columns with loc and ability to do slicing for row and column selections), another big difference is that you can use loc to assign values to specific rows and columns. If you try to select a subset of the dataframe using boolean series and attempt to change a value of that subset selection you will likely get the SettingWithCopy warning.

Let’s say you’re trying to change the “upper management” column for all the rows whose salary is bigger than 60000.

This:

mask = df["salary"] > 60000
df[mask]["upper management"] = True

throws the warning that “A value is is trying to be set on a copy of a slice from a Dataframe” and won’t work because df[mask] creates a copy and trying to update “upper management” of that copy has no effect on the original df.

But this succeeds:

mask = df["salary"] > 60000
df.loc[mask,"upper management"] = True

Note that in both cases you can do df[df["salary"] > 60000] or df.loc[df["salary"] > 60000], but I think storing boolean condition in a variable first is cleaner.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 14:53

Thẻ: pythonpandasseriespandas-loc

Thiết kế website giá rẻ

Danh mục

Why use loc in Pandas?

Performance Consideration on multiple columns “Chained Assignment” with and without using .loc