Pandas is undoubtedly one of the most important Python libraries in data science and machine learning. Many practitioners in the field are using it on a daily basis. But even as an experienced user, you might have been puzzled by this warning: SettingWithCopyWarning. A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.
It could be tempting to just ignore it, especially when the code does what you wanted. But taking the warning seriously does not only prevent bugs from happening in your data pipeline, it is also a good way to understand the internals of Pandas better. In this post, I will look at two different data transformations that both raise the warning, but operate differently.
Let us assume we have the following dataframe:
d = {'col1': [1.0, 2.0, 3.0], 'col2': [4.0, 5.0, np.nan]}
df = pd.DataFrame(data=d)
print(df)
# col1 col2
# 0 1.0 4.0
# 1 2.0 5.0
# 2 3.0 NaN
Causes of the warning
Example 1
The first cause of the warning occurs like this: We might want to drop the missing values and call dropna()
on our dataframe.
df_no_na = df.dropna()
print(df_no_na)
# col1 col2
# 0 1 3.0
# 1 2 4.0
We got a new dataframe df_no_na
. Pandas removed the row with missing values. Good job! Let’s continue our transformation, by creating a third column, which is the sum of column one and two.
df_no_na.['new_column'] = df_no_na['col1'] + df_no_na['col2']
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.
Oh no! We got SettingWithCopyWarning
. Maybe we should just follow the instructions and use .loc[]
indexing?
df_no_na.loc[:,'new_column'] = df_no_na['col1'] + df_no_na['col2']
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.
df_no_na
raised the warning, even though we tried using a different indexing method. This is somewhat puzzling because SettingWithCopyWarning
suggests using .loc[row_indexer,col_indexer] = value instead
and that is pretty much what we just tried above.
Example 2
Our first attempt to transform our dataframe was unsuccessful, lets try another approach: Instead of dropping all rows with missing values, we now want to replace the missing value in column 'col2'
at row df.col1 == 3.0
with the value 0.0
. We do this via indexing the row and the column:
df[df.col1 == 3]['col2'] = 0
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.
Indexing in Pandas
The first thing we have to understand is that the expression df[df.col1 == 3]['col2']
is so-called syntactic sugar, making it easier to express an operation, but potentially more difficult to understand the underlying process.
# expression with syntactic sugar
df[df.col1 == 3.0]['col2'] = 0.0
# executed code
df.__getitem__(df.__getitem__('col1') == 3.0).__setitem__('col2', 0.0)
Both versions will produce identical results and both cause the warning. The first one is certainly easier to read, however the second one reveals an important aspect of the operation: What we perform is a __setitem__
operation ON a __getitem__
operation. The inner __getitem__
expression indexes the row and the subsequent __setitem__
changes the value in the column 'col2'
of that row. This chaining of a get operation and a set operation is what actually causes SettingWithCopyWarning
.
# the 'chaining' cause of the warning in pseudo code
df.__getitem__('some location').__setitem__('some new value')
The warning in Pandas is implemented, because when a __setitem__
is chained to a __getitem__
operation the new values are only applied to a subset of our data and there is a chance that this operation is not what we want to do. It might be the case that this is our intention, but Pandas simply does not know and warns us that we are potentially Setting
some new value only on copy
of our dataframe.
Solutions
A recommended solution for Example 2
is using .loc[]
indexing. This is actually what is proposed in the warning message and might be the main cause that the Pandas developers had in mind. With .loc[]
there is no chaining, just a __setitem__
operation on the original dataframe.
# Our code
df.loc[df.col1 == 3.0, 'col2'] = 0.0
# Code executed
df.loc.__setitem__((df.__getitem__('col1') == 3.0, 'col2'), 0.0)
print(df)
col1 col2
0 1.0 4.0
1 2.0 5.0
2 3.0 0,0
If the filtering operation becomes longer and more complex, it is recommended to build a mask:
# mask
filter_mask = ((df.col1 == 3.0) & (df.col2 < 100))
df.loc[filter_mask]
col1 col2
2 3.0 0.0
But what about the second example? The issue is a little bit different here. If we call dropna()
on the dataframe, Pandas actually returns a new dataframe, but the new dataframe keeps a trace from the old one in its private property _is_copy
, which is usually none, but become a weakref
in this case:
df_no_na._is_copy
<weakref at 0x1122c80e8; to 'DataFrame' at 0x109586828>
There are three potential solutions that can be applied here:
I. Modifying the original dataframe
# inplace
df.dropna(inplace=True)
print(df)
col1 col2
0 1.0 4.0
1 2.0 5.0
The argument inplace
can be found in many Pandas operations and allows you to change the return type of your function. The default value for inplace
is False
. Once set to True
, it will return None
and apply the operation to the object it was called upon. In this case, the object df
is modified and the row with missing values was dropped from the original df.
II. Returning a deep copy of the dataframe
# copy
df_no_na = df.dropna().copy()
This is possibly the easiest but also the brute force approach to the problem. Calling copy on the dataframe returns a deep copy of the original object and no strings are attached to the old one anymore.
III. Resetting the _is_copy property
# resetting the _is_copy property
df_no_na = df.dropna()
df_no_na._is_copy = None
Configuring SettingWithCopyWarning
It is also possible to change the level of SettingWithCopyWarning
from warning
to error
, which will break your code, to None
which will just ignore it. It may be useful to adjust the error handling to your particular use of the library: For instance, it would be a wise decision to raise an error when code is running in production, whereas for prototyping and exploration, a warning may suffice.
# set level of severity to 'raise', 'warn' or None
pd.set_option('mode.chained_assignment', 'warn')
Concluding remarks
What is most puzzling about SettingWithCopyWarning
to me, is that it occurs in seemingly unrelated contexts (just compare the examples above), and requires completely different types of solutions. The warning message itself certainly covers the case of chained assignment and provides a potential solution (.loc[]
indexing), but it is rather misleading when it arises in the first example.
However, there is an underlying concept that SettingWithCopyWarning
is all about: It warns you about whether the operation that is performed is applied to a view of the original dataframe or a copy of it and it wants to let you to know, that you should resolve this ambiguity in your code. In both examples there is a potential risk of assigning an operation to the wrong object. In this sense, SettingWithCopyWarning
is a reminder of writing clear and explicit data transformations that, if taken seriously, will help you bring more robustness into your data transformations.