Introduction to Data

2.1. Introduction to Data#

2.1.1. What is Data?#

In Python there are seven key pieces of data and data structures with which we will be working:

strings (text)
integers (whole numbers)
floats (decimal numbers)
Booleans (True or False)
lists
tuples (lists that cannot be changed in memory)
dictionaries (key : value)

In this chapter, we will explore each of these. While this section focuses on data: strings, integers, floats, and Booleans; the next section will focus on data structures: lists, tuples, and dictionaries.

Data are pieces of information (the singular is datum)i.e., integers, floats, and strings. Data structures are objects that make data relational, i.e. lists, tuples, and dictionaries. Start to train your brain to recognize the Python syntax for these pieces of data and data structures discussed below.

For your convenience, here is a cheatsheet for helping you refer back to these types of data.

Cheatsheet for Data Types and Data Structures in Python

Name	Type	Category	Example	Trick	Mutable
string	`str`	text	`"William"`	quotes	no
integer	`int`	number	`1`	no decimal	no
float	`float`	number	`1.1`	with decimal	no
Boolean	`bool`	boolean	`True`	True or False	no
list	`list`	sequence	`[1, 1.1, "one"]`	[]	yes
tuple	`tuple`	sequence	`(1, 1.1, "one")`	()	no
set	`set`	sequence	`{1, 1.1, "one"}`	{} - unique	yes
dictionary	`dict`	mapping	`{"name": "tom"}`	{key:value}	yes

2.1.2. Strings#

Strings are a sequence of characters. A good way to think about a string is as a text. We can create a string in Python by designating a unique variable name and setting = to something within quotation marks, i.e. ” ” or ‘ ‘.

The opening of a quotation mark indicates to Python that a string has begun and the closing of the same style of a quotation mark indicates the close of a string. It is important to use the same style of quotation mark for a string, either a double or a single.

Let’s take a look at a few examples of strings.

2.1.2.1. Examples of Strings#

In our first example of a string, we will name our first string object with the variable name first_string. Notice that we are using an underscore (_) to indicate a separation of two words in our variable name. This is common practice in Python. We cannot use a space in a variable name. Another naming convention could be firstString, where the first word is lower cased while the start of each sequential word is capitalized. This naming convention, while acceptable, is more common in other programming languages, such as JavaScript.

first_string = "This is a string."

Now that we have created our first object, let’s try and print it off. We can do this using the print function.

print(first_string)

This is a string.

Let’s take a look at another example. This time, however, we will use two '.

second_string = 'This is a string too.'
print(second_string)

This is a string too.

In the Trinket application below, create a string and print it off.

from IPython.display import IFrame
IFrame('https://trinket.io/embed/python3/3fe4c8f3f4', 700, 500)

Let’s now take a look at a bad example of a string. In the example below, we will intentionally try to create an error message by combining two different types of quotation marks.

bad_string = "This is a bad example of a string'

  Input In [6]
    bad_string = "This is a bad example of a string'
                                                    ^
SyntaxError: EOL while scanning string literal

This error message tells us that we have triggered a “SyntaxError”. A SyntaxError is a Python error in which we have used improper syntax, or coding language. This means that when our computer tries to execute the above code, it does not know how to interpret the line. You will frequently encounter errors like this in your journey as a programmer.

It is always important to read the error message as it will inform you about the specifics of the problem. For example, I can see that my error was triggered because of something in the input cell 6. If you were executing this same command within a Python script, you would see the specific line that triggered the error. This allows you to begin to debug or identify the source of the error and fix it. If, for some reason, you cannot understand the cause of the error, it may help to Google the error message and search for answers on forums, such as StackOverflow.

Sometimes, it will be necessary to have a very long string that spans multiple lines within a Python script. In these instances, you can use three of the same quotation mark style (single or double) consistently to create a string over multiple lines.

long_string = '''

This is a verrry long string.

'''

print(long_string)

This is a verrry long string.

2.1.3. Working with Strings as Data#

Often when working within Python, you will not be simply creating data, you will be manipulating it and changing it. Because humanists frequently work with strings, I recommend spending a good deal of time practicing and memorizing the basic ways we work with strings in Python via the built-in methods.

In order to interact with strings as pieces of data, we use methods and functions. The chief functions for interacting with strings on a basic level come standard with Python. This means that you do not need to install third-party libraries. Later in this textbook we will do more advanced things with strings using third-party libraries, such as Regex, but for now, we will simply work with the basic methods.

Let’s learn to manipulate strings now through code, but first we need to create a string. Let’s call it sentence.

sentence = "I am going to learn how to program in Python!"

2.1.3.1. Upper Method#

It is not a very clever name, but it will work for our purposes. Now, let’s try to convert the entire string into all uppercase letters. We can do this with the method .upper(). Notice that the .upper() is coming after the string and within the () are no arguments. This is a way you can easily identify a method (as opposed to a function). We will learn more about these distinctions in the chapters on functions and classes.

print(sentence.upper())

I AM GOING TO LEARN HOW TO PROGRAM IN PYTHON!

2.1.3.2. Lower Method#

Noice that our string is now all uppercase. We can do the same thing with the .lower() method, but this method will make everything in the string lowercase.

print(sentence.lower())

i am going to learn how to program in python!

2.1.3.3. Capitalize Method#

On the surface, these methods may appear to only be useful in niche circumstances. While these methods are useful for making strings look the way you want them to look, they have far greater utility. Imagine if you wanted to search for a name, “William”, in a string. What if the data you are examining is from emails, text messages, etc. William may be capitalized or not. This means that you would have to run two searches for William across a string. If, however, you lowercase the string before you search, you can simply search for “william” and you will find all hits. This is one of the things that happens on the back-end of most search engines to ensure that your search is not strictly case-sensitive. In Python, however, it is important to do this step of data cleaning before running searches over strings.

Let’s explore another method, .capitalize(). This method will allow you to capitalize a string.

first_name = "william"

print(first_name.capitalize())

William

I will use this in niche circumstances, particularly when I am performing data cleaning and need to ensure that all names or proper nouns in a dataset are cleaned and well-structured.

2.1.3.4. Replace Method#

Perhaps the most useful string method is .replace(). Notice in the cells below, replace takes a mandatory of two arguments, or things passed between the parentheses. Each is separated by a comma. The first argument is the substring or piece of the string that you want to replace and the second argument is what you want to replace it with. Why is this so useful? If you are using Python to analyze texts, those texts will, I promise, never be well-cleaned. They may have bad encoding, characters that will throw off searches, bad OCR, multiple line breaks, hyphenated characters, the list goes on. The .replace() method allows you to quickly and effectively clean textual data so that it can be standardized.

Unlike the above methods, for .replace(), we need to put two things within the parentheses. These are known as arguments. This method requires two and they must be in order. The first is the thing that you want to replace and the second is the thing that you want to replace it with. These will be separated by a parentheses and both must be strings. It should, therefore, look something like this:

.replace("the thing to replace", "the thing you want to replace it with")

In the example below, let’s try and replace the period at the end of “Mattingly.”

introduction = "My name is William Mattingly."

print(introduction.replace(".", ""))

My name is William Mattingly

Excellent! Now, let’s try and reprint off introduction_sentence and see what happens.

print(introduction)

My name is William Mattingly.

Uh oh! Something is not right. Nothing has changed! Indeed, this is because strings are immutable objects. Immutable objects are objects that cannot be changed in memory. As we will see in the next chapter with lists, the other type of object is one that can be changed in memory. These are known as mutable objects. In order to change a string, therefore, you must recreate it in memory or create a new string object from it. Let’s try and do that below.

new_introduction = introduction.replace(".", "")

print(new_introduction)

My name is William Mattingly

2.1.3.5. Split Method#

Strings have a lot of other useful methods that we will be learning about throughout this textbook, such as the split() method which returns a list of substrings, or smaller strings, that are split by the delimiter, which is the argument of the method. The delimiter tells Python how to split the string. By default, split() will split your string at the whitespace. Let’s try and split the following string.

book_name = "Harry Potter and the Chamber of Secrets"

As with Replace, Split is a method and, therefore, we use it with a . after the variable name.

print(book_name.split())

['Harry', 'Potter', 'and', 'the', 'Chamber', 'of', 'Secrets']

Our book name is now split into individual words thanks to the split function’s default delimiter which is the white space. Notice, however, that the delimiter vanishes from our list of substrings. In the next chapter, we will learn about data structures, such as lists. In that section, we will learn how to grab specific indices, or sections, of the list.

Split can also take an argument that identifies where to split a string, e.g. something other than a whitespace. Let’s say, I were interested in grabbing the subtitle of the book name. I could split the string at the ” and “.

print(book_name.split(" and "))

['Harry Potter', 'the Chamber of Secrets']

As we can see, we have successfully separated the title from the subtitle. We will be working with strings a lot more as we progress through this textbook. You should at this point be familiar with what strings are, how to create them, and how to interact with them through some basic methods, such as upper(), lower(), capitalize(), replace(), and split().

2.1.4. Numbers (Integers and Floats)#

Numbers are represented in programming languages in two several ways. The two we will deal with are integers and floats.

An integer is a digit that does not contain a decimal place, i.e. 1 or 2 or 3. This can be a number of any size, such as 100,001,200. A float, on the other hand, is a digit with a decimal place. So, while 1 is an integer, 1.0 is a float. Floats, like integers, can be of any size, but they necessarily have a decimal place, i.e. 200.0020002938. In python, you do not need any special characters to create an integer or float object. You simply need an equal sign. In the example below, we have two objects which are created with a single equal sign. These objects are titled an_integer and a_float with the former being an object that corresponds to the integer 1 and the latter being an object that corresponds to the float 1.1.

2.1.4.1. Examples of Numbers#

int1 = 1

print(int1)

float1 = 1.1

print (float1)

1.1

2.1.5. Working with Numbers as Data#

Now that you understand how strings work, let’s begin exploring another type of data: numbers. Numbers in Python exist in two chief forms:

integers
floats

As noted above, integers are numbers without a decimal point, whereas floats are numbers with a decimal point. This is am important distinction that you should remember, especially when working with data imported from and exported to Excel.

As digital humanists, you might be thinking to yourself, “I just work with text, why should I care so much about numbers?” The answer? Numbers allow us to perform quantitative analysis. What if you wished to know the times a specific author wrote to a colleague or to which places he wrote most frequently, as was the case with the Republic of Letters project at Stanford?

To perform that kind of analysis, you need a command of how numbers work in Python, how to perform basic mathematical functions on those numbers, and how to interact with them. Further, numbers are essential to understand for performing more advanced functions in Python, such as Loops, explored in Chapter 3.

Throughout your DH project, you will very likely need to manipulate numbers through mathematical operations. Here is a cheatsheet of the common operations:

Cheatsheet for Mathematical Operations

operation	code	description	example	result
addition	`+`	adds numbers together	`1+1`	2
subtraction	`-`	subtracts one number from another	`1-1`	0
multiplication	`*`	multiplies two numbers	`1*2`	2
exponential multiplication	`**`	performs exponential multiplication	`2**2`	4
division	`/`	divides one number from another	`2/2`	1
modulo	`%`	remainder	`2%7`	1
floor	`//`	occurences	`2//7`	3

The way in which you create a number object in Python is the to create an object name, use the equal sign and type the number. If your number has a decimal, Python will automatically consider it a float. If it does not, it will automatically consider it an integer. Let’s create an integer and a float.

an_integer = 1
print(an_integer)

a_float = 1.1
print(a_float)

1.1

In the Trinket application below, try to create a float and an integer and print each off.

IFrame('https://trinket.io/embed/python3/3fe4c8f3f4', 700, 500)

Now that you understand how to create a number in memory, let’s try to add two numbers together. In the example below, we will add 1 and 1 together.

print(1+1)

In the Trinket application above, try to perform the other mathematical operations.

2.1.6. Booleans#

The term boolean comes from Boolean algebra, which is a type of mathematics that works in binary logic. Binary is the basis for all computers, save for the more nascent quantum computers. Binary is 0 or 1; off or on; true or false. A boolean object in programming languages is either True or False. True is 1, while False is 0. In Python we can express these concepts with capitalized T or F in True or False. Let’s make one such object now.

2.1.6.1. Examples of Booleans#

bool1 = True

print(bool1)

True

2.1.7. Conclusion#

This chapter has introduced you two some of the essential types of data: strings, integers, floats, and booleans. It has also introduced you to some of the key methods and operations that you can perform on strings and numbers. Before moving onto the next chapter, I recommend spending some time and testing out these methods on your own data. Try and manipulate an input text to locate and retrieve specific information.

2.1.8. Challenge Questions#

What would be the output of the command below?

type('What type of data is this?')

What about this?

print("Hello, world!')

And what about this?

person = "John"
person = person.lower()
person2 = person
person = person.upper()
print(person2)

Expand below to see why the output looks the way it does.

If you did not get this right, let’s talk about why. Remember, programming is a sequential set of operations.

Number	Line	Meaning
1	person = “John”	establishes that the variable is a string and that it is “John”
2	person = person.lower()	this will lowercase the string so at this stage, it would be “john”
3	person2 = person	this creates person2 as a variable and points it to person
4	person = person.upper()	this changes the person variable and uppercases it. The variable person2, however, remains the same