Introduction to Data

2.2.1. What is Data?#

In Python there are seven key pieces of data and data structures with which we will be working:

  • strings (text)

  • integers (whole numbers)

  • floats (decimal numbers)

  • Booleans (True or False)

  • lists

  • tuples (lists that cannot be changed in memory)

  • dictionaries (key : value)

In the next two chapters, we will explore each of these. While this chapter focuses on data: strings, integers, floats, and Booleans; the next chapter will focus on data structures: lists, tuples, and dictionaries.

Data are pieces of information (the singular is datum)i.e., integers, floats, and strings. Data structures are objects that make data relational, i.e. lists, tuples, and dictionaries. Before you proceed to lesson three, you MUST have a basic understanding of the ways in which you create data in Python and the ways in which you make that data relational through data structures. Start to train your brain to recognize the Python syntax for these pieces of data and data structures discussed below.

2.2.2. Strings#

Strings are a sequence of characters. A good way to think about a string is as a text. We can create a string in python by designating a unique variable name and setting = to something within quotation marks, i.e. ” ” or ‘ ‘.

The opening of a quotation mark indicates to Python that a string has begun and the closing of the same style of a quotation mark indicates the close of a string. It is important to use the same style of quotation mark for a string, either a double or a single.

Let’s take a look at a few examples of strings. Examples of Strings#

In our first example of a string, we will name our first string object with the variable name “first_string”. Notice that we are using an underscore to indicate a separation of two words in our variable name. This is common practice in Python. We cannot use a space in a variable name. Another naming convention could be “firstString”, where the first word is lower cased while the start of each sequential word is capitalized. This naming convention, while acceptable, is more common in other programming languages, such as JavaScript.

first_string = "This is a string."

Now that we have created our first object, let’s try and print it off. We can do this using the print function.

print (first_string)
This is a string.

Let’s take a look at another example. This time, however, we will use two single-quotation-marks.

second_string = 'This is a string too.'

In the cell below, try to print off the variable “second_string”

print ()

Let’s now take a look at a bad example of a string. In the example below, we will intentionally try to create an error message by combining two different types of quotation marks.

bad_string = "This is a bad example of a string'
  Input In [6]
    bad_string = "This is a bad example of a string'
SyntaxError: EOL while scanning string literal

This error message tells us that we have triggered a “SyntaxError”. A SyntaxError is a Python error in which we have used improper syntax, or coding language. This means that when our computer tries to execute the above code, it does not know how to interpret the line. You will frequently encounter errors like this in your journey as a programmer.

It is always important to read the error message as it will inform you about the specifics of the problem. For example, I can see that my error was triggered because of something in the input cell 6. If you were executing this same command within a Python script, you would see the specific line that triggered the error. This allows you to begin to debug or identify the source of the error and fix it. If, for some reason, you cannot understand the cause of the error, it may help to Google the error message and search for answers on forums, such as StackOverflow.

Sometimes, it will be necessary to have a very long string that spans multiple lines within a Python script. In these instances, you can use three of the same quotation mark style (single or double) consistently to create a string over multiple lines.

long_string = '''

This is a verrry long string.


Try to print off the long_string variable in the cell below.

print ()

The output is “str” which tells us that the object is a string.

2.2.3. Working with Strings as Data#

Often when working within Python, you will not be simply creating data, you will be manipulating it and changing it. Because humanists frequently work with strings, I recommend spending a good deal of time practicing and memorizing the basic ways we work with strings in Python via the built-in methods.

In order to interact with strings as pieces of data, we use methods and functions. The chief functions for interacting with strings on a basic level come standard with Python. This means that you do not need to install third-party libraries. Later in this textbook we will do more advanced things with strings using third-party libraries, such as Regex, but for now, we will simply work with the basic methods.

Let’s learn to manipulate strings now through code, but first we need to create a string. Let’s call it sentence.

sentence = "I am going to learn how to program in Python!" Upper Method#

It is not a very clever name, but it will work for our purposes. Now, let’s try to convert the entire string into all uppercase letters. We can do this with the method .upper(). Notice that the .upper() is coming after the string and within the () are no arguments. This is a way you can easily identify a method (as opposed to a function). We will learn more about these distinctions in the chapters on functions and classes.

print (sentence.upper())

Noice that our string is now all uppercase. We can do the same thing with the .lower() method, but this method will make everything in the string lowercase.

print (sentence.lower())
i am going to learn how to program in python! Capitalize Method#

On the surface, these methods may appear to only be useful in niche circumstances. While these methods are useful for making strings look the way you want them to look, they have far greater utility. Imagine if you wanted to search for a name, “William”, in a string. What if the data you are examining is from emails, text messages, etc. William may be capitalized or not. This means that you would have to run two searches for William across a string. If, however, you lowercase the string before you search, you can simply search for “william” and you will find all hits. This is one of the things that happens on the back-end of most search engines to ensure that your search is not strictly case-sensitive. In Python, however, it is important to do this step of data cleaning before running searches over strings.

Let’s explore another method, .capitalize(). This method will allow you to capitalize a string.

first_name = "william"
print (first_name.capitalize())

I will use this in niche circumstances, particularly when I am performing data cleaning and need to ensure that all names or proper nouns in a dataset are cleaned and well-structured. Replace Method#

Perhaps the most useful string method is .replace(). Notice in the cells below, replace takes a mandatory of two arguments, or things passed between the parentheses. Each is separated by a comma. The first argument is the substring or piece of the string that you want to replace and the second argument is what you want to replace it with. Why is this so useful? If you are using Python to analyze texts, those texts will, I promise, never be well-cleaned. They may have bad encoding, characters that will throw off searches, bad OCR, multiple line breaks, hyphenated characters, the list goes on. The .replace() method allows you to quickly and effectively clean textual data so that it can be standardized.

Unlike the above methods, for .replace(), we need to put two things within the parentheses. These are known as arguments. This method requires two and they must be in order. The first is the thing that you want to replace and the second is the thing that you want to replace it with. These will be separated by a parentheses and both must be strings. It should, therefore, look something like this:

.replace(“the thing to replace”, “the thing you want to replace it with”)

In the example below, let’s try and replace the period at the end of “Mattingly.”

introduction = "My name is William Mattingly."
print (introduction.replace(".", ""))
My name is William Mattingly

Excellent! Now, let’s try and reprint off introduction_sentence and see what happens.

print (introduction)
My name is William Mattingly.

Uh oh! Something is not right. Nothing has changed! Indeed, this is because strings are immutable objects. Immutable objects are objects that cannot be changed in memory. As we will see in the next chapter with lists, the other type of object is one that can be changed in memory. These are known as mutable objects. In order to change a string, therefore, you must recreate it in memory or create a new string object from it. Let’s try and do that below.

new_introduction = introduction.replace(".", "")
print (new_introduction)
My name is William Mattingly Split Method#

Strings have a lot of other useful methods that we will be learning about throughout this textbook, such as the split() method which returns a list of substrings, or smaller strings, that are split by the delimiter, which is the argument of the method. The delimiter tells Python how to split the string. By default, split() will split your string at the whitespace. Let’s try and split the following string.

book_name = "Harry Potter and the Chamber of Secrets"

As with Replace, Split is a method and, therefore, we use it with a . after the variable name.

print (book_name.split())
['Harry', 'Potter', 'and', 'the', 'Chamber', 'of', 'Secrets']

Our book name is now split into individual words thanks to the split function’s default delimiter which is the white space. Notice, however, that the delimiter vanishes from our list of substrings. In the next chapter, we will learn about data structures, such as lists. In that section, we will learn how to grab specific indices, or sections, of the list.

Split can also take an argument that identifies where to split a string, e.g. something other than a whitespace. Let’s say, I were interested in grabbing the subtitle of the book name. I could split the string at the ” and “.

print (book_name.split(" and "))
['Harry Potter', 'the Chamber of Secrets']

As we can see, we have successfully separated the title from the subtitle. We will be working with strings a lot more as we progress through this textbook. You should at this point be familiar with what strings are, how to create them, and how to interact with them through some basic methods, such as upper(), lower(), capitalize(), replace(), and split().

2.2.4. Numbers (Integers and Floats)#

Numbers are represented in programming languages in two several ways. The two we will deal with are integers and floats.

An integer is a digit that does not contain a decimal place, i.e. 1 or 2 or 3. This can be a number of any size, such as 100,001,200. A float, on the other hand, is a digit with a decimal place. So, while 1 is an integer, 1.0 is a float. Floats, like integers, can be of any size, but they necessarily have a decimal place, i.e. 200.0020002938. In python, you do not need any special characters to create an integer or float object. You simply need an equal sign. In the example below, we have two objects which are created with a single equal sign. These objects are titled an_integer and a_float with the former being an object that corresponds to the integer 1 and the latter being an object that corresponds to the float 1.1. Examples of Numbers#

int1 = 1
print (int1)
float1 = 1.1
print (float1)

2.2.5. Working with Numbers as Data#

Now that you understand how strings work, let’s begin exploring another type of data: numbers. Numbers in Python exist in two chief forms:

  • integers

  • floats

As noted above, integers are numbers without a decimal point, whereas floats are numbers with a decimal point. This is am important distinction that you MUST remember, especially when working with data imported from and exported to Excel.

As digital humanists, you might be thinking to yourself, “I just work with text, why should I care so much about numbers?” The answer? Numbers allow us to perform quantitative analysis. What if you want to know the times a specific author wrote to a colleague or to which places he wrote most frequently, as was the case with the Republic of Letters project at Stanford? In order to perform that kind of analysis, you MUST have a command of how numbers work in Python, how to perform basic mathematical functions on those numbers, and how to interact with them. Further, numbers are essential to understand for performing more advanced functions in Python, such as Loops, explored in Chapter 03.01.

The way in which you create a number object in Python is the to create an object name, use the equal sign and type the number. If your number has a decimal, Python will automatically consider it a float. If it does not, it will automatically consider it an integer.

num1 = 1
num2 = 2

Throughout your DH project, you will very likely need to manipulate numbers through mathematical operations. Here is a list of the common operations:

  1. Addition +

  2. Subtraction –

  3. Multiplication *

  4. Exponential Multiplication **

  5. Division /

  6. Modulo % #This will return the remainder, e.g. 2%7 will yield 1.

  7. Floor // #This will return the max number of times two numbers can be divided into each other, e.g. 2//7 will yield 3.

print (num1+num2)
print (num1-num2)
print (num1*num2)
print (num1/num2)
print (100*8)
print (num1/num2)
print (num1//num2)
print (5/2)
print (5//2)

In addition to this, you will very often in loops need to identify Comparison Operators (equal to, less than, etc). Here is a list of those:

  1. Equal to (==)

  2. Greater than (>)

  3. Less than (<)

  4. Less than or equal to (<=)

  5. Greater than or equal to (>=)

  6. Not equal to (!=)

We will address these in greater detail in a later chapter as they are particularly useful when you work with loops.

2.2.6. Booleans#

The term boolean comes from Boolean algebra, which is a type of mathematics that works in binary logic. Binary is the basis for all computers, save for the more nascent quantum computers. Binary is 0 or 1; off or on; true or false. A boolean object in programming languages is either True or False. True is 1, while False is 0. In Python we can express these concepts with capitalized T or F in True or False. Let’s make one such object now. Examples of Booleans#

bool1 = True
print (bool1)
bool2 = True
bool3 = False
print (bool3)

2.2.7. Conclusion#

This chapter has introduced you two some of the essential types of data: strings, integers, floats, and booleans. It has also introduced you to some of the key methods and operations that you can perform on strings and numbers. Before moving onto the next chapter, I recommend spending some time and testing out these methods on your own data. Try and manipulate an input text to locate and retrieve specific information.

If you haven’t yet installed Python, that’s okay! There are free compilers online, including the one I have available at https://pythonhumanities.com/lesson-02-storing-data-in-python/. Once you feel comfortable with types of data and how to create and interact with them in Python, feel free to continue into the next chapter as we explore data structures.

2.2.8. Challenge Questions#

What would be the output of the command below?

type('What type of data is this?')

What about this?

print("Hello, world!')
  Input In [19]
    print("Hello, world!')
SyntaxError: EOL while scanning string literal

And what about this?

person = "John"
person = person.lower()
person2 = person
person = person.upper()

Expand below to see why the output looks the way it does.

If you did not get this right, let’s talk about why. Remember, programming is a sequential set of operations.





person = “John”

establishes that the variable is a string and that it is “John”


person = person.lower()

this will lowercase the string so at this stage, it would be “john”


person2 = person

this creates person2 as a variable and points it to person


person = person.upper()

this changes the person variable and uppercases it. The variable person2, however, remains the same

2.2.9. Quiz#