Time Series Data

4.3. Time Series Data#

4.3.1. What is Time Series Data#

Time series data is data that reflects either time or dates. In Pandas this type of data is known as datetime. If you are working with time series data, as we shall see, there are significant reasons to ensure that Pandas understands that the data at hand is a date or a time. It allows for easily manipulation and cleaning of inconsistent data formatting. Let us consider a simple example. Imagine we were given dates from one source as 01/02/2002 and another as 01.02.2002. Both are valid date formats, but they are structured entirely differently. Imagine now you had a third dataset that organized the data as 2 January 2002. Your task is to merge all these datasets together.

If you wanted to do that, you could write out some Python and script them into alignment, but Pandas offers the ability to do that automatically. In order to leverage that ability, however, you must tell Pandas that the data at hand is datetime data. Exactly how you do that, we will learn in this chapter.

Time series data is important for many different aspects of industry and academia. In the financial sector, time series data allows for one to understand the past performance of a stock. This is particularly useful in machine learning predictions which need to understand the past to predict accurately the future. More importantly, they need to understand the past a sequence of data. In the humanities, time series data is important for understand historical context and, as we shall see, plotting data temporally. Understanding how to work with time series data, therefore, in Pandas is absolutely essential.

4.3.2. About the Dataset#

In this chapter, we will be working with an early version of a dataset I helped cultivate at the Bitter Aloe Project, a digital humanities project that explores apartheid violence in South Africa during the 20th century. This dataset comes from Vol. 7 of the Truth and Reconciliation Commission’s Final Report. I am using not our final, well-cleaned version of this dataset, rather an earlier version for one key reason. It contains problematic cells and structure. This is more reflective of real-world data, which will often times come from multiple sources and need to be cleaned and structured. As such, it is good practice in this chapter to try and address some of the common problems that you will encounter with time series data.

import pandas as pd
df = pd.read_csv("../data/trc.csv")
df

	ObjectId	Last	First	Description	Place	Yr	Homeland	Province	Long	Lat	HRV	ORG
0	1	AARON	Thabo Simon	An ANCYL member who was shot and severely inju...	Bethulie	1991.0	NaN	Orange Free State	25.97552	-30.503290	shoot\|injure	ANC\|ANCYL\|Police\|SAP
1	2	ABBOTT	Montaigne	A member of the SADF who was severely injured ...	Messina	1987.0	NaN	Transvaal	30.039597	-22.351308	injure	SADF
2	3	ABRAHAM	Nzaliseko Christopher	A COSAS supporter who was kicked and beaten wi...	Mdantsane	1985.0	Ciskei	Cape of Good Hope	27.6708791	-32.958623	beat	COSAS\|Police
3	4	ABRAHAMS	Achmat Fardiel	Was shot and blinded in one eye by members of ...	Athlone	1985.0	NaN	Cape of Good Hope	18.50214	-33.967220	shoot\|blind	SAP
4	5	ABRAHAMS	Annalene Mildred	Was shot and injured by members of the SAP in ...	Robertson	1990.0	NaN	Cape of Good Hope	19.883611	-33.802220	shoot\|injure	Police\|SAP
...	...	...	...	...	...	...	...	...	...	...	...	...
20829	20888	XUZA	Mandla	Was severely injured when he was stoned by a f...	Carletonville	1991.0	NaN	Transvaal	27.397673	-26.360943	injure\|stone	ANC
20830	20889	YAKA	Mbangomuni	An IFP supporter and acting induna who was sho...	Mvutshini	1993.0	KwaZulu	Natal	30.28172	-30.868900	shoot	NaN
20831	20890	YALI	Khayalethu	Was shot by members of the SAP in Lingelihle, ...	Cradock	1986.0	NaN	Cape of Good Hope	25.619176	-32.164221	shoot	SAP
20832	20891	YALO	Bikiwe	An IFP supporter whose house and possessions w...	Port Shepstone	1994.0	NaN	Natal	30.4297304	-30.752126	destroy	ANC
20833	20892	YALOLO-BOOYSEN	Geoffrey Yali	An ANC supporter and youth activist who was to...	George	1986.0	NaN	Cape of Good Hope	22.459722	-33.964440	torture\|detain\|torture	ANC\|SAP

20834 rows × 12 columns

As we can see, we have a few different columns which are relatively straight forward. In this notebook, however, I want to focus on Yr, which is a column that contains a single year referenced within the description. This corresponds to the year in which the violence described occurred. Notice, however, that we have a problem. Year is being recognized as a float (a number with a decimal place), or floating number. To confirm our suspicion, let’s take a look at the data types by using the following command.

display(df.dtypes)

ObjectId         int64
Last            object
First           object
Description     object
Place           object
Yr             float64
Homeland        object
Province        object
Long            object
Lat            float64
HRV             object
ORG             object
dtype: object

Here, we can see all the different columns and their corresponding data types. Notice that Yr has float64. This confirms our suspicion. Why is this a problem? Well, if we were to try and plot the data by year (see the bar graph below), we would have floating numbers in that graph. This does not look clean. We could manually adjust these years to have no decimal place, but that requires effort on a case-by-case basis. Instead, it is best practice to convert these floats either to integers or to datetime data. Both have their advantages, but if your end goal is larger data analysis on time series data (not just plotting the years), I would opt for the latter. In order to do either, however, we must clean the data to get it into the correct format.

df['Yr'].value_counts().sort_index().plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/f37b1dda16078b26b4e47b6046a7e8615686c85e8eeb2e41b04b873172a3b2aa.png

4.3.3. Cleaning the Data from Float to Int#

Let’s first try and convert our float column into an integer column. If we execute the command below which would normally achieve this task, we get the following error.

df['Yr'] = df['Yr'].astype(int)

---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
<ipython-input-4-dc1db5c67903> in <module>
----> 1 df['Yr'] = df['Yr'].astype(int)

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
   5804         else:
   5805             # else, only a single dtype is given
-> 5806             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5807             return self._constructor(new_data).__finalize__(self, method="astype")
   5808 

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
    412 
    413     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 414         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    415 
    416     def convert(

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
    590         values = self.values
    591 
--> 592         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    593 
    594         new_values = maybe_coerce_values(new_values)

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\dtypes\cast.py in astype_array_safe(values, dtype, copy, errors)
   1307 
   1308     try:
-> 1309         new_values = astype_array(values, dtype, copy=copy)
   1310     except (ValueError, TypeError):
   1311         # e.g. astype_nansafe can fail on object-dtype of strings

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\dtypes\cast.py in astype_array(values, dtype, copy)
   1255 
   1256     else:
-> 1257         values = astype_nansafe(values, dtype, copy=copy)
   1258 
   1259     # in pandas we don't store numpy str dtypes, so convert to object

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1166 
   1167     elif np.issubdtype(arr.dtype, np.floating) and np.issubdtype(dtype, np.integer):
-> 1168         return astype_float_to_int_nansafe(arr, dtype, copy)
   1169 
   1170     elif is_object_dtype(arr):

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\dtypes\cast.py in astype_float_to_int_nansafe(values, dtype, copy)
   1211     """
   1212     if not np.isfinite(values).all():
-> 1213         raise IntCastingNaNError(
   1214             "Cannot convert non-finite values (NA or inf) to integer"
   1215         )

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

At the very bottom, we see why the error was returned. “IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer”. This means that somewhere in our data, there are a few blank cells in the Year column. We need to fill in these blank cells. To do that, we can use the fillna function that we met earlier in this textbook.

df = df.fillna(0)

If we try and rerun our same command as above, you will notice we have no errors.

df['Yr'] = df['Yr'].astype(int)

Now, let’s see if it worked by displaying the data types again.

display(df.dtypes)

ObjectId         int64
Last            object
First           object
Description     object
Place           object
Yr               int32
Homeland        object
Province        object
Long            object
Lat            float64
HRV             object
ORG             object
dtype: object

Notice that Yr is now int32. Success! Now that we have the data in the correct format, let’s plot it out. We can plot out the frequency of violence based on year by using value counts. This will go through the entire Yr column and count all the values identified and store them as a dictionary of frequencies.

df['Yr'].value_counts()

  2835
  2648
  2556
  2056
  1867
  1793
  1665
  1015
   935
   744
   438
   352
   319
   301
   280
   128
   124
   123
   111
    88
       84
    69
    60
    53
    37
    32
    19
    14
    14
    12
    11
    10
    10
     8
     6
     5
     3
     3
     3
     2
     1
Name: Yr, dtype: int64

This looks great, but let’s try and plot it.

df['Yr'].value_counts().plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/e3a3ca3a7b1f739880ecf75b3a14d00beee79f0ebf10995313f7dd842e2f7fa1.png

What do you notice that is horribly wrong about our bar graph? If you noticed that it is not chronological, you’d be right. It would be quite odd to present our data in this format. When we are examining time series data, we need to visualize that data chronologically (usually). We can fix this, by adding sort_index().

df['Yr'].value_counts().sort_index()

       84
     1
     3
   280
    32
    69
    88
    37
    19
    11
     8
    14
    14
    10
    10
     6
     5
    12
   111
   319
   128
    60
    53
   438
   124
   123
   352
   301
  1665
  2056
   744
  1015
   935
  2556
  1793
  2648
  2835
  1867
     3
     2
     3
Name: Yr, dtype: int64

Notice that we have now preserved the value counts, but organized them in their correct order. We can now try plotting that data.

df['Yr'].value_counts().sort_index().plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/9de8212c72ae18b1f84df82a78aed741befa7c46fc6829e0768f3061a9e09857.png

We have a potential issue, however. That first row, 0, is throwing off our bar graph. What if I didn’t want to represent 0, or no date, in the graph. I can solve this problem a few different ways. Let’s first create a new dataframe called val_year.

val_year = df["Yr"].value_counts().sort_index()
val_year

       84
     1
     3
   280
    32
    69
    88
    37
    19
    11
     8
    14
    14
    10
    10
     6
     5
    12
   111
   319
   128
    60
    53
   438
   124
   123
   352
   301
  1665
  2056
   744
  1015
   935
  2556
  1793
  2648
  2835
  1867
     3
     2
     3
Name: Yr, dtype: int64

With this new dataframe, I can simply start at index 1 and then graph the data. Notice that the 0 value is now gone.

val_year.iloc[1:].plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/6cef2eba3c09e4c56048e61d61411b4e5af842671d5db6feb445efacbe0bc6ad.png

Although we have been able to now plot our time series data chronologically, Pandas has not seen this as a datetime type. Instead, it has viewed these years solely as integers. In order to work with the years as time series data formally, we need to convert the integers into datetime format.

4.3.4. Convert to Time Series DateTime in Pandas#

Our goal here will be to create a new column that will store Yr as a datetime type. One might think that we could easily just convert everything to datetime. Normally the following command would work, but instead we get this error.

df['Dates'] = pd.to_datetime(df['Yr'], format='%Y')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    508         try:
--> 509             values, tz = conversion.datetime_to_datetime64(arg)
    510             dta = DatetimeArray(values, dtype=tz_to_dtype(tz))

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\_libs\tslibs\conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'int'>

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-14-cc2c68a810bf> in <module>
----> 1 df['Dates'] = pd.to_datetime(df['Yr'], format='%Y')

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
    881                 result = result.tz_localize(tz)  # type: ignore[call-arg]
    882     elif isinstance(arg, ABCSeries):
--> 883         cache_array = _maybe_cache(arg, format, cache, convert_listlike)
    884         if not cache_array.empty:
    885             result = arg.map(cache_array)

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _maybe_cache(arg, format, cache, convert_listlike)
    193         unique_dates = unique(arg)
    194         if len(unique_dates) < len(arg):
--> 195             cache_dates = convert_listlike(unique_dates, format)
    196             cache_array = Series(cache_dates, index=unique_dates)
    197             # GH#39882 and GH#35888 in case of None and NaT we get duplicates

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    391 
    392     if format is not None:
--> 393         res = _to_datetime_with_format(
    394             arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
    395         )

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    511             return DatetimeIndex._simple_new(dta, name=name)
    512         except (ValueError, TypeError):
--> 513             raise err
    514 
    515 

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    498 
    499         # fallback
--> 500         res = _array_strptime_with_fallback(
    501             arg, name, tz, fmt, exact, errors, infer_datetime_format
    502         )

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
    434 
    435     try:
--> 436         result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
    437         if "%Z" in fmt or "%z" in fmt:
    438             return _return_parsed_timezone_results(result, timezones, tz, name)

c:\users\wma22\appdata\local\programs\python\python39\lib\site-packages\pandas\_libs\tslibs\strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data '0' does not match format '%Y' (match)

Just as the NaN cells plagued us above, so too has the 0s that we filled them with. Fortunately, we can fix this issue by passing the keyword argument errors=”coerce”.

df['Dates'] = pd.to_datetime(df['Yr'], format='%Y', errors="coerce")

display(df.dtypes)

ObjectId                int64
Last                   object
First                  object
Description            object
Place                  object
Yr                      int32
Homeland               object
Province               object
Long                   object
Lat                   float64
HRV                    object
ORG                    object
Dates          datetime64[ns]
dtype: object

And like magic, we have not only created a new column, but notice that it is in datetime64[ns] format. We should also understand the keyword argument passed here, format. Format takes a formatted string that will tell Pandas how to interpret the data being passed to it. Because our integer referred to a single year, we use %Y. Let’s try and plot this data now to see how it looks.

df['Dates'].value_counts().sort_index().plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/289579607d3427c922d7aeafce77fdb333cd0f030726381e91ffde77fe33806b.png

While this data is now plotted as Pandas-structured time series data, it does not look good. Our dates are rendered in the long, full format that has both the date (in its entirety) and the time. Let’s fix this by first, extracting the relevant data. In this case, the year and the counts.

new_df = df['Dates'].value_counts().sort_index()
new_df

1958-01-01       1
1959-01-01       3
1960-01-01     280
1961-01-01      32
1962-01-01      69
1963-01-01      88
1964-01-01      37
1965-01-01      19
1966-01-01      11
1967-01-01       8
1968-01-01      14
1969-01-01      14
1970-01-01      10
1971-01-01      10
1972-01-01       6
1973-01-01       5
1974-01-01      12
1975-01-01     111
1976-01-01     319
1977-01-01     128
1978-01-01      60
1979-01-01      53
1980-01-01     438
1981-01-01     124
1982-01-01     123
1983-01-01     352
1984-01-01     301
1985-01-01    1665
1986-01-01    2056
1987-01-01     744
1988-01-01    1015
1989-01-01     935
1990-01-01    2556
1991-01-01    1793
1992-01-01    2648
1993-01-01    2835
1994-01-01    1867
1996-01-01       3
1997-01-01       2
1998-01-01       3
Name: Dates, dtype: int64

Next, we need to convert that data into a new DataFrame.

new_df = pd.DataFrame(new_df)
new_df.head()

	Dates
1958-01-01	1
1959-01-01	3
1960-01-01	280
1961-01-01	32
1962-01-01	69

Now that we have that new DataFrame created, let’s fix our column name and change Dates to ViolentActs.

new_df = new_df.rename(columns={"Dates": "ViolentActs"})
new_df.head()

	ViolentActs
1958-01-01	1
1959-01-01	3
1960-01-01	280
1961-01-01	32
1962-01-01	69

With the new DataFrame, we can also fix the index so that it is strictly the year. Because Pandas knows that the index is a datetime type, then we can use the extra method, year, to grab just the year.

new_df.index = new_df.index.year
new_df.head()

	ViolentActs
1958	1
1959	3
1960	280
1961	32
1962	69

Notice that our data is now just the year, the only piece of data in the time series data that matters to us. With that new DataFrame in the correct format, we can now plot it.

new_df.plot.bar(figsize=(20,5))

<AxesSubplot:>

../_images/b974aa9bfd8dabd7ea19d009ffda55b5adb684e76980e7376e9dc9943ce8cf83.png

And thus we have successfully plotted our datetime data after properly formatting it in Pandas. While working with time series data in Pandas as a datetime can be a bit more complex in the beginning, it allows for you to more advanced things, such as we saw above by calling the year with .year. As we will see in the next few chapters, there are other advantages as well.