25
loading...
This website collects cookies to deliver better user experience
DataFrame
. Often this question comes up right away for new users who have loaded some data into a DataFrame
and now want to do something useful with it. The natural way for most programmers to think of what to do next is to build a loop. They may not understand the “correct” way to work with DataFrames
yet, but even experienced pandas and NumPy developers will consider iterating over the rows of a DataFrame
to solve a problem. Instead of trying to find the one right answer about iteration, it makes better sense to understand the issues involved and know when to choose the best solution.DataFrame
rows. It also turns out that question has the most copied answer with a code block on the entire site. The Stack Overflow developers say thousands of people view the answer weekly and copy it to solve their problem. Obviously people want to iterate over DataFrame
rows!DataFrame
rows using the top solution. Other answers to the question (especially the second highest rated answer) do a fairly good job of giving other options, but the entire list of 26 (and counting!) answers is extremely confusing. Instead of asking how to iterate over DataFrame
rows, it makes more sense to understand what the options are that are available, what their advantages and disadvantages are, and then choose the one that makes sense for you. In some cases, the top voted answer for iteration might be the best choice!DataFrame
is not automatically the wrong way to solve a problem. However, in most cases what beginners are trying to do with iteration is better done with another approach. However, no one should ever feel bad about writing a first solution that uses iteration instead of other (perhaps better) ways. That’s often the best way to learn, you can think of a first solution as the first draft of your essay, you can improve it with some editing.DataFrame
. First off, let’s all agree that this is not a good way to look at the content of a DataFrame
. The standard rendering of a DataFrame
, whether it is rendered with print
or viewed with a Jupyter notebook using display
or as an output in a cell will be far better than what would be printed using custom formatting.DataFrame
is large, only some columns and rows may be visible by default. Use head
and tail
to get a sense of the data. If you want to only look at subsets of a DataFrame
, instead of using a loop to only display those rows, use the powerful indexing capabilities of pandas. With a little practice, you can select any combinations of rows or columns to show. Start there first.DataFrame
that includes some logic.DataFrame
to use. I’ll do this by making some fake data (using Faker). Note that the columns are different data types (we have some strings, an integer, and dates).from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from faker import Faker
fake = Faker()
today = datetime.now()
next_month = today + timedelta(days=30)
df = pd.DataFrame([[fake.first_name(), fake.last_name(),
fake.date_this_decade(), fake.date_between_dates(today, next_month),
fake.city(), fake.state(), fake.zipcode(), fake.random_int(-100,1000)]
for r in range(100)],
columns=['first_name', 'last_name', 'start_date',
'end_date', 'city', 'state', 'zipcode', 'balance'])
df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes
df['end_date'] = pd.to_datetime(df['end_date'])
df.dtypes
first_name object
last_name object
start_date datetime64[ns]
end_date datetime64[ns]
city object
state object
zipcode object
balance int64
dtype: object
df.head()
first_name last_name start_date end_date city state \
0 Katherine Moody 2020-02-04 2021-06-28 Longberg Maryland
1 Sarah Merritt 2021-03-02 2021-05-30 South Maryborough Tennessee
2 Karen Hensley 2020-02-29 2021-06-23 Brentside Missouri
3 David Ferguson 2020-02-02 2021-06-14 Judithport Virginia
4 Phillip Davis 2020-07-17 2021-06-04 Louisberg Minnesota
zipcode balance
0 20496 493
1 18495 680
2 63702 427
3 66787 587
4 98616 211
DataFrame
contains customer data and we have a scoring function for customers that uses multiple customer attributes to give them a score between ‘A’ and ‘F’. Any customer with a negative balance is scored an ‘F’, above 500 is an ‘A’, and after that, logic depends on if a customer is a ‘legacy’ customer and what state they live in.from dataclasses import dataclass
@dataclass
class Customer:
first_name: str
last_name: str
start_date: datetime
end_date: datetime
city: str
state: str
zipcode: str
balance: int
def score_customer(customer:Customer) -> str:
"""Give a customer a credit score.
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, -5))
'F'
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
'C'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
'D'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150))
'C'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 250))
'B'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 350))
'B'
>>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350))
'A'
>>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 50))
'C'
"""
if customer.balance < 0:
return 'F'
if customer.balance > 500:
return 'A'
# legacy vs. non-legacy
if customer.start_date > datetime(2020, 1, 1):
if customer.balance < 100:
return 'D'
elif customer.balance < 200:
return 'C'
elif customer.balance < 300:
return 'B'
else:
if customer.state in ['Illinois', 'Indiana']:
return 'B'
else:
return 'A'
else:
if customer.balance < 100:
return 'C'
else:
return 'A'
import doctest
doctest.testmod()
TestResults(failed=0, attempted=8)
DataFrame.iterrows
. This is a generator that returns the index for a row along with the row as a Series
. If you aren’t familiar with what a generator is, you can think of it as a function you can iterate over. As a result, calling next
on it will yield the first element.next(df.iterrows())
(0,
first_name Katherine
last_name Moody
start_date 2020-02-04 00:00:00
end_date 2021-06-28 00:00:00
city Longberg
state Maryland
zipcode 20496
balance 493
Name: 0, dtype: object)
Series
, the columns are the same as the attributes of our Customer
class, so we might be able to just pass this into our scoring function.score_customer(next(df.iterrows())[1])
'A'
df['score'] = [score_customer(c[1]) for c in df.iterrows()]
iterrows
may be a perfectly acceptable solution! I won’t judge you. I use it plenty of times, and will summarize at the end how to make decisions about the possible solutions.iterrows
can be grouped into the following categories.def vectorized_score(df):
return np.select([df['balance'] < 0,
df['balance'] > 500, # technically not needed, would fall through
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] < 100)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 100) &
(df['balance'] < 200)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 200) &
(df['balance'] < 300)),
((df['start_date'] > datetime(2020,1,1)) &
(df['balance'] >= 300) &
df['state'].isin(['Illinois', 'Indiana'])),
((df['start_date'] >= datetime(2020,1,1)) &
(df['balance'] < 100)),
], # conditions
['F',
'A',
'D',
'C',
'B',
'B',
'C'], # choices
'A') # default score
assert (df['score'] == vectorized_score(df)).all()
np.select
(you can read more about it and other various ways to update DataFrame
s in my article on using where
and mask
.) I sort of like using np.select
when you have multiple conditions like this, although it’s not extremely readable. We could have also done this using more code with vectorized updates for each step and made it much more readable. It would probably be similar in terms of speed.DataFrame
?%timeit vectorized_score(df)
2.75 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit [score_customer(c[1]) for c in df.iterrows()]
13.5 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
DataFrame
. Note that I did use a list comprehension already in our first solution, but it was along with iterrows
. This time instead of using iterrows
, the data is pulled out of each column in the DataFrame
directly and then iterated over. No Series
is created in this case. If your function has multiple arguments, you can use zip
to make tuples of the arguments, passing in the columns in your DataFrame
to match the argument order. Now to do this, I’ll need a modified scoring function, since I don’t have already constructed Customer
objects in my DataFrame
, and creating them just to invoke the function would add another layer. I only use three attributes of the customer, so here’s a simple rewrite.def score_customer_attributes(balance:int, start_date:datetime, state:str) -> str:
if balance < 0:
return 'F'
if balance > 500:
return 'A'
# legacy vs. non-legacy
if start_date > datetime(2020, 1, 1):
if balance < 100:
return 'D'
elif balance < 200:
return 'C'
elif balance < 300:
return 'B'
else:
if state in ['Illinois', 'Indiana']:
return 'B'
else:
return 'A'
else:
if balance < 100:
return 'C'
else:
return 'A'
next(zip(df['balance'], df['start_date'], df['state']))
(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')
DataFrame
.df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()
%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
171 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Series
need to take place.Customer
object to pass in. This is a bit uglier, but still quite fast.%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]
254 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
DataFrame.apply
. Note that to apply this to rows, you need to pass in the correct axis since it defaults to applying to each column. The axis argument here is specifying which index you want to have in the object passed to your function. We want each object to be a customer row, with the columns as the index.assert (df.apply(score_customer, axis=1) == df['score']).all()
%timeit df.apply(score_customer, axis=1)
3.57 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Series
object for each row.iter
functions available for DataFrame
s: iteritems
, itertuples
, and iterrows
. DataFrames
also support iteration directly, but these functions don’t all iterate over the same things. Since understanding what all these methods do by just seeing their names can be really confusing, let’s review them all here.iter(df)
(calls the DataFrame. __iter__
method). Iterate over the info axis, which for DataFrames
is the column names, not the values.
next(iter(df)) # 'first_name'
'first_name'
iteritems
. Iterate over the columns, returning a tuple of column name and the column as a Series
.
next(df.iteritems())
next(df.items()) # these two are equivalent
('first_name',
0 Katherine
1 Sarah
2 Karen
3 David
4 Phillip
...
95 Robert
96 Christopher
97 Kristen
98 Nicholas
99 Caroline
Name: first_name, Length: 100, dtype: object)
items
. This is the same as above. iteritems
actually just invokes items
.
next(df.iterrows())
(0,
first_name Katherine
last_name Moody
start_date 2020-02-04 00:00:00
end_date 2021-06-28 00:00:00
city Longberg
state Maryland
zipcode 20496
balance 493
score A
score3 A
Name: 0, dtype: object)
iterrows
. We already have seen this, it iterates through the rows, but returns them as a tuple of index and the row, as a Series
.itertuples
. Iterates over the rows, returning a namedtuple
for each row. You can optionally change the name of the tuple and disable the index being returned.
next(df.itertuples())
Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')
iterrows
, we only need to look at itertuples
. As you can see, the returned value, a namedtuple
, can be used in our original function.assert ([score_customer(c[1]) for c in df.iterrows()] == df['score']).all()
%timeit [score_customer(t) for t in df.itertuples()]
858 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
namedtuple
for each row is much faster than construction of a Series
.iterrows
and itertuples
. A namedtuple
can properly represent any type in a single row. In our case, we have strings, date types, and integers. A pandas Series
, however, has to have only one datatype for the entire Series
. Because our datatypes were diverse enough, they were all represented as object
types, and ended up retaining their type, with no functionality issues for us. But this is not always the case!itertuples
and iterrows
to be slightly different between these two methods, so watch out.dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]})
dfmixed.dtypes
integer_column int64
float_column float64
dtype: object
next(dfmixed.itertuples())
Pandas(Index=0, integer_column=1, float_column=1.1)
next(dfmixed.iterrows())
(0,
integer_column 1.0
float_column 1.1
Name: 0, dtype: float64)
DataFrame
has columns that cannot be represented as Python variable names, you will not be able to access them using dot syntax. So if you have a column named 2b
or My Column
then you’ll have to access them using positional names (i.e. the first column will be called _1
). For iterrows
, the row will be a Series
, so you’ll have to access the columns using ["2b"]
or ["My Column"]
.iloc
indexer on the DataFrame
to select any row. Of course, this is really no different from any other iteration, while also being non-idiomatic so others reading your code will probably find it hard to read and understand. I built a naive version of this in the performance comparison code for the summary below, if you want to see it (the performance was horrible).DataFrame
iteration interesting. I know I learned a few useful things along the way.