22
loading...
This website collects cookies to deliver better user experience
DataFrame
or Series
that has a time based index. The values in the time series can be anything else that can be contained in the containers, they are just accessed using date or time values. A time series container can be manipulated in many ways in pandas, but for this article I will focus just on the basics of indexing. Knowing how indexing works first is important for data exploration and use of more advanced features.DatetimeIndex
is used to provide indexing for pandas Series
and DataFrame
s and works just like other Index
types, but provides special functionality for time series operations. We’ll cover the common functionality with other Index
types first, then talk about the basics of partial string indexing.import pandas as pd
import numpy as np
import datetime
# this is an easy way to create a DatetimeIndex
# both dates are inclusive
d_range = pd.date_range("2021-01-01", "2021-01-20")
# this creates another DatetimeIndex, 10000 minutes long
m_range = pd.date_range("2021-01-01", periods=10000, freq="T")
# daily data in a Series
daily = pd.Series(np.random.rand(len(d_range)), index=d_range)
# minute data in a DataFrame
minute = pd.DataFrame(np.random.rand(len(m_range), 1),
columns=["value"],
index=m_range)
# time boundaries not on the minute boundary, add some random jitter
mr_range = m_range + pd.Series([pd.Timedelta(microseconds=1_000_000.0 * s)
for s in np.random.rand(len(m_range))])
# minute data in a DataFrame, but at a higher resolution
minute2 = pd.DataFrame(np.random.rand(len(mr_range), 1),
columns=["value"],
index=mr_range)
daily.head()
2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
Freq: D, dtype: float64
minute.head()
value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
minute2.head()
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
DatetimeIndex
has a resolution that indicates to what level the Index
is indexing the data. The three indices created above have distinct resolutions. This will have ramifications in how we index later on.print("daily:", daily.index.resolution)
print("minute:", minute.index.resolution)
print("randomized minute:", minute2.index.resolution)
daily: day
minute: minute
randomized minute: microsecond
Series
or DataFrame
with a DatetimeIndex
, let’s just look at some of the typical indexing functionality.DatetimeIndex
works just like other indices in pandas, but has extra functionality. (The extra functionality can be more useful and convenient, but just hold tight, those details are next). If you already understand basic indexing, you may want to skim until you get to partial string indexing. If you haven’t read my articles on indexing, you should start with the basics and go from there.datetime
-like objects for indexing, we need to match the resolution of the index.daily[pd.Timestamp("2021-01-01")]
0.29330017699861666
try:
minute[pd.Timestamp("2021-01-01 00:00:00")]
except KeyError as ke:
print(ke)
Timestamp('2021-01-01 00:00:00')
KeyError
is raised because in a DataFrame
, using a single argument to the []
operator will look for a column, not a row. We have a single column called value
in our DataFrame
, so the code above is looking for a column. Since there isn’t a column by that name, there is a KeyError
. We will use other methods for indexing rows in a DataFrame
.iloc
indexer is integer offset based, it’s pretty clear how it works, not much else to say here. It works the same for all resolutions.daily.iloc[0]
0.29330017699861666
minute.iloc[-1]
value 0.999354
Name: 2021-01-07 22:39:00, dtype: float64
minute2.iloc[4]
value 0.646703
Name: 2021-01-01 00:04:00.452614, dtype: float64
datetime
-like objects, you need to have exact matches for single indexing. It’s important to realize that when you make datetime
or pd.Timestamp
objects, all the fields you don’t specify explicitly will default to 0.jan1 = datetime.datetime(2021, 1, 1)
daily.loc[jan1]
0.29330017699861666
minute.loc[jan1] # the defaults for hour, minute, second make this work
value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64
try:
# we don't have that exact time, due to the jitter
minute2.loc[jan1]
except KeyError as ke:
print("Missing in index: ", ke)
# but we do have a value on that day
# we could construct it manually to the microsecond if needed
jan1_ms = datetime.datetime(2021, 1, 1, 0, 0, 0, microsecond=minute2.index[0].microsecond)
minute2.loc[jan1_ms]
Missing in index: datetime.datetime(2021, 1, 1, 0, 0)
value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64
[]
) or the .iloc
indexer.daily[0:2] # first two, end is not inclusive
2021-01-01 0.293300
2021-01-02 0.921466
Freq: D, dtype: float64
minute[0:2] # same
value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
minute2[1:5:2] # every other
value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333
minute2.iloc[1:5:2] # works with the iloc indexer as well
value
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:03:00.922019 0.509333
datetime
-like objects also works. Note that the end item is inclusive, and the defaults for hours, minutes, seconds, and microseconds will set the cutoff for the randomized data on minute boundaries (in our case).daily[datetime.date(2021,1,1):datetime.date(2021, 1,3)] # end is inclusive
2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
Freq: D, dtype: float64
minute[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]
value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
minute2[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
[]
and .loc
, but not .iloc
, as expected. Remember, .iloc
is for integer offset indexing.minute2.loc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
try:
# no! use integers with iloc
minute2.iloc[datetime.datetime(2021, 1, 1): datetime.datetime(2021, 1, 1, 0, 2, 0)]
except TypeError as te:
print(te)
cannot do positional indexing on DatetimeIndex with these indexers [2021-01-01 00:00:00] of type datetime
datetime
objects. I know we started with objects, but now you see that for interactive use and exploration, strings are very helpful. You can pass in a string that can be parsed as a full date, and it will work for indexing.daily["2021-01-04"]
0.10723013753233923
minute.loc["2021-01-01 00:03:00"]
value 0.834881
Name: 2021-01-01 00:03:00, dtype: float64
minute.loc["2021-01-01 00:03:00":"2021-01-01 00:05:00"] # end is inclusive
value
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
2021-01-01 00:05:00 0.291089
daily["2021"] # all items match (since they were all in 2021)
daily["2021-01"] # this one as well (and only in January for our data)
2021-01-01 0.293300
2021-01-02 0.921466
2021-01-03 0.040813
2021-01-04 0.107230
2021-01-05 0.201100
2021-01-06 0.534822
2021-01-07 0.070303
2021-01-08 0.413683
2021-01-09 0.316605
2021-01-10 0.438853
2021-01-11 0.258554
2021-01-12 0.473523
2021-01-13 0.497695
2021-01-14 0.250582
2021-01-15 0.861521
2021-01-16 0.589558
2021-01-17 0.574399
2021-01-18 0.951196
2021-01-19 0.967695
2021-01-20 0.082931
Freq: D, dtype: float64
DataFrame
as well.minute["2021-01-01"]
<ipython-input-67-96027d36d9fe>:1: FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like `frame[string]`, is deprecated and will be removed in a future version. Use `frame.loc[string]` instead.
minute["2021-01-01"]
value
2021-01-01 00:00:00 0.124186
2021-01-01 00:01:00 0.542545
2021-01-01 00:02:00 0.557347
2021-01-01 00:03:00 0.834881
2021-01-01 00:04:00 0.732195
... ...
2021-01-01 23:55:00 0.687931
2021-01-01 23:56:00 0.001978
2021-01-01 23:57:00 0.770587
2021-01-01 23:58:00 0.154300
2021-01-01 23:59:00 0.777973
[1440 rows x 1 columns]
[]
for DataFrame
string indexing (as we saw above, []
should be used for column access, not rows). Depending on whether the value is found in the index or not, you may get an error or a warning. Use .loc
instead so you can avoid the confusion.minute2.loc["2021-01-01"]
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-01 23:55:00.642728 0.749619
2021-01-01 23:56:00.238864 0.053027
2021-01-01 23:57:00.168598 0.598910
2021-01-01 23:58:00.103543 0.107069
2021-01-01 23:59:00.687053 0.941584
[1440 rows x 1 columns]
minute2.loc["2021-01-01":"2021-01-02"]
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 23:55:00.604411 0.987777
2021-01-02 23:56:00.134674 0.159338
2021-01-02 23:57:00.508329 0.973378
2021-01-02 23:58:00.573397 0.223098
2021-01-02 23:59:00.751779 0.685637
[2880 rows x 1 columns]
minute2.loc["2021-01-01":"2021-01-02 13:32:01"]
value
2021-01-01 00:00:00.641049 0.527961
2021-01-01 00:01:00.088244 0.142192
2021-01-01 00:02:00.976195 0.269042
2021-01-01 00:03:00.922019 0.509333
2021-01-01 00:04:00.452614 0.646703
... ...
2021-01-02 13:28:00.925951 0.969213
2021-01-02 13:29:00.037827 0.758476
2021-01-02 13:30:00.309543 0.473163
2021-01-02 13:31:00.363813 0.846199
2021-01-02 13:32:00.867343 0.007899
[2253 rows x 1 columns]
minute2
) and minute (minute
) resolution data examples. Note that every time you get a slice of the DataFrame
, the value returned is a DataFrame
. When it’s an exact match, it’s a Series
.minute2.loc["2021-01-01"] # slice - the entire day
minute2.loc["2021-01-01 00"] # slice - the first hour of the day
minute2.loc["2021-01-01 00:00"] # slice - the first minute of the day
minute2.loc["2021-01-01 00:00:00"] # slice - the first minute and second of the day
value
2021-01-01 00:00:00.641049 0.527961
print(str(minute2.index[0])) # note the string representation include the full microseconds
minute2.loc[str(minute2.index[0])] # slice - this seems incorrect to me, should return Series not DataFrame
minute2.loc[minute2.index[0]] # exact match
2021-01-01 00:00:00.641049
value 0.527961
Name: 2021-01-01 00:00:00.641049, dtype: float64
minute.loc["2021-01-01"] # slice - the entire day
minute.loc["2021-01-01 00"] # slice - the first hour of the day
minute.loc["2021-01-01 00:00"] # exact match
value 0.124186
Name: 2021-01-01 00:00:00, dtype: float64
Series
), but instead a slice match (because the return value is a DataFrame
). On the minute resolution DataFrame
it worked as I expected.asof
. Often, when you have data that is either randomized in time or may have missing values, getting the most recent value as of a certain time is preffered. You could do this yourself, but it looks little cleaner to use asof
.minute2.loc[:"2021-01-01 00:00:03"].iloc[-1]
# vs
minute2.asof("2021-01-01 00:00:03")
value 0.527961
Name: 2021-01-01 00:00:03, dtype: float64
truncate
which is sort of like slicing. You specify a value of before
or after
(or both) to indicate cutoffs for data. Unlike slicing which includes all values that partially match the date, truncate
assumes 0 for any unspecified values of the date.minute2.truncate(after="2021-01-01 00:00:03")
value
2021-01-01 00:00:00.641049 0.527961
Index
in pandas. Understanding time series slicing will allow you to quickly navigate time series data and quickly move on to more advanced time series analysis.