I’ve been exploring how to optimize my code and ran across
.at method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
import pandas as pd import numpy as np from string import letters, lowercase, uppercase lt = list(letters) lc = list(lowercase) uc = list(uppercase) def gdf(rows, cols, seed=None): """rows and cols are what you'd pass to pd.MultiIndex.from_product()""" gmi = pd.MultiIndex.from_product df = pd.DataFrame(index=gmi(rows), columns=gmi(cols)) np.random.seed(seed) df.iloc[:, :] = np.random.rand(*df.shape) return df seed = [3, 1415] df = gdf([lc, uc], [lc, uc], seed) print df.head().T.head().T
df looks like:
a A B C D E a A 0.444939 0.407554 0.460148 0.465239 0.462691 B 0.032746 0.485650 0.503892 0.351520 0.061569 C 0.777350 0.047677 0.250667 0.602878 0.570528 D 0.927783 0.653868 0.381103 0.959544 0.033253 E 0.191985 0.304597 0.195106 0.370921 0.631576
.loc and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')] print "using .at ", df.at[('a', 'A'), ('c', 'C')] using .loc 0.37374090276 using .at 0.37374090276
Test speed using
%%timeit df.loc[('a', 'A'), ('c', 'C')] 10000 loops, best of 3: 180 µs per loop
Test speed using
%%timeit df.at[('a', 'A'), ('c', 'C')] The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage
6.11 * 8 is a lot faster than
What are the limitations of
.at? I’m motivated to use it. The documentation says it’s similar to
.loc but it doesn’t behave similarly. Example:
# small df sdf = gdf([lc[:2]], [uc[:2]], seed) print sdf.loc[:, :] A B a 0.444939 0.407554 b 0.460148 0.465239
print sdf.at[:, :] results in
TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the
df.get_value is deprecated as of version 0.21.0. Using
df.iat is the recommended method going forward.
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
Note that there is also
df.get_value, which may be even quicker at accessing single values:
In : %timeit df.loc[('a', 'A'), ('c', 'C')] 10000 loops, best of 3: 187 µs per loop In : %timeit df.at[('a', 'A'), ('c', 'C')] 100000 loops, best of 3: 8.33 µs per loop In : %timeit df.get_value(('a', 'A'), ('c', 'C')) 100000 loops, best of 3: 3.62 µs per loop
As you asked about the limitations of
.at, here is one thing I recently ran into (using pandas 0.22). Let’s use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C']) df2 = df.copy() A B C 4 0 2 3 5 0 4 1 6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C 4 0 100 3 5 0 4 1 6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that
.at tries to conserve the datatype (here:
A B C 4 0 100 10 5 0 4 1 6 10 20 30
That seems to be a difference to
df2.loc[4, 'C'] = 10.05
yields the desired
A B C 4 0 2 10.05 5 0 4 1.00 6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from
int). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: ‘a_string’
It will work, however, if one uses a string on which
int() actually works as noted by @n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123' A B C 4 0 2 3 5 123 4 1 6 10 20 30
.at is an optimized data access method compared to
.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Insetad,
.at selects particular elemnt of a data frame positioned at the given indexed_row and labeled_column.
.at takes one row and one column as input argument, whereas
.loc may take multiple rows and columns. Oputput using
.at is a single element and using
.loc maybe a Series or a DataFrame.