Introduction to Data
Introduction to Data¶
2.2.1. Some Quick Notes about Terminology and Commands we will Use¶
18.104.22.168. The Print Function¶
The very first thing that every programmer learns is how to print something off using that programming language. In most tutorials, you will see the phrase “Hello, World!” used. In this textbook, let’s try something a bit different. Let’s say I wanted to print off “Hello, William”. We can do this in Python by using the print function. The print function lets us print off some piece of data. In our case, that piece of data will be a piece of text, known as a string (see below). Our text is “Hello, William”. The thing that we want to print off must be located between an open parentheses and a close parentheses. Let’s try to execute the print command in the cell below.
As we can see, the code that we typed in the Jupyter cell outputted beneath it. This has worked wonderfully, but what if our Python notebook needs to be more dynamic, meaning it needs to adjust based on some user input. Perhaps, we need to use a name other than William based on some user-defined input. To do this, we would need to store a piece of data in memory. We can do this in Python by creating a variable that will point to an object. Before we get into how this is done, let’s first get to know objects and variables.
When we import or create data within Python, we are essentially creating an object in memory with a variable. These two words, object and variable mean slightly different things and, but are often used interchangeably. We will not get into the complexities of their differences and why they exist in this textbook, but for now, view an object as something that is created by a Python script and stored in your computer’s memory so that it can be used later in a program.
Think of your computer’s memory rather like your own brain. Imagine if you needed to remember what the word for “hello” in German. You may use your memory rather like a flashcard, where “hello” in English equates to “hallo” in German. In Python, we create objects in a similar way. The object would be the dictionary entry of “hello: hallo”.created
In order to reference this dictionary entry in memory, we need a way to reference it. We do this with a variable. The variable is simply the name of the item in our script that will point to that object in memory. Variables make it so that we can have an easy-to-remember name to reference, call, and change that object in memory.
Variables can be created by typing a unique word, followed by an = sign, followed by the specific data. As we will learn throughout this chapter, there are many types of data that are created differently. Let’s create our first object before we begin. This will be a string, or a piece of text. (We will learn about these in more detail below.) In my case, I want to create the object author. I want author to be associated with my name in memory. In the cell, or block of code, below, let’s do this.
author = "William Mattingly"
Excellent! We have created our first object. Now, it is time to use that object. Below, we will learn about ways we can manipulate strings, but for now, let’s simply see if that object exists in memory. We can do this with the print function.
The print function will become your best friend in Python. It is, perhaps, the function I use most commonly. The reason for this is because the print function allows for you to easily debug, or identify problems and fix them, within your code. It allows us to print off objects that are stored in memory.
To use the print function, we type the word print followed by an open parentheses. After the open parentheses, we place the object or that or piece of data that we want to print. After that, we close the function with the close parentheses. Let’s try to print off our new object author to make sure it is in memory.
Notice that when I execute the cell above, I see an output that relates to the object we created above. What would happen if I tried to print off that object, but I used a capital letter, rather than a lowercase one at the beginning, so Author, rather than author?
22.214.171.124. Case Sensitivity¶
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Input In , in <cell line: 1>() ----> 1 print (Author) NameError: name 'Author' is not defined
The scary looking block of text above indicates that we have produced an error in Python. This mistake teaches us two things. First, python is case sensitive. This means that if any object (or string) will need to be matched in not only letters, but also the case of those letters. Second, this mistake teaches us that we can only call objects that have been created and stored in memory.
Now that we have the variable pointing to a specific piece of data, we can make our print function above a bit more dynamic. Let’s try and print off the same statement as before, but with the new full author name. I don’t expect you to understand the specifics of the code below, rather simply understand that we will frequently need to store variables in memory so that we can use them later in our programs.
Hello, William Mattingly!
126.96.36.199. Reserved Words¶
When working with Python, there are a number of words known as reserve words. These are words that cannot be used as variable names. As of Python version 3.6, there are a total of 33 reserve words. In can sometimes be difficult to remember all of these reserve words, so Python has a nice built in function, “help”. If we execute the following command, we will see an entire list.
Here is a list of the Python keywords. Enter any keyword to get more help. False class from or None continue global pass True def if raise and del import return as elif in try assert else is while async except lambda with await finally nonlocal yield break for not
These are words that you cannot use as a variable name.
188.8.131.52. Built-in Types¶
In addition to reserve words, there are also built-in types in Python. These are words that you can use (as we will see) to convert one type of data into another. There are 94 of these in total. Unlike reserve words, you can use a built-in type as a variable name. It is, however, strongly discouraged to do so because it will overwrite the intended use of these variable names in your script.
import builtins [getattr(builtins, d) for d in dir(builtins) if isinstance(getattr(builtins, d), type)]
[ArithmeticError, AssertionError, AttributeError, BaseException, BlockingIOError, BrokenPipeError, BufferError, BytesWarning, ChildProcessError, ConnectionAbortedError, ConnectionError, ConnectionRefusedError, ConnectionResetError, DeprecationWarning, EOFError, OSError, Exception, FileExistsError, FileNotFoundError, FloatingPointError, FutureWarning, GeneratorExit, OSError, ImportError, ImportWarning, IndentationError, IndexError, InterruptedError, IsADirectoryError, KeyError, KeyboardInterrupt, LookupError, MemoryError, ModuleNotFoundError, NameError, NotADirectoryError, NotImplementedError, OSError, OverflowError, PendingDeprecationWarning, PermissionError, ProcessLookupError, RecursionError, ReferenceError, ResourceWarning, RuntimeError, RuntimeWarning, StopAsyncIteration, StopIteration, SyntaxError, SyntaxWarning, SystemError, SystemExit, TabError, TimeoutError, TypeError, UnboundLocalError, UnicodeDecodeError, UnicodeEncodeError, UnicodeError, UnicodeTranslateError, UnicodeWarning, UserWarning, ValueError, Warning, OSError, ZeroDivisionError, _frozen_importlib.BuiltinImporter, bool, bytearray, bytes, classmethod, complex, dict, enumerate, filter, float, frozenset, int, list, map, memoryview, object, property, range, reversed, set, slice, staticmethod, str, super, tuple, type, zip]
Of this long list, I recommend paying particular attention to the ones that you are more likely to write naturally: bool, dict, float, int, list, map, object, property, range, reversed, set, slice, str, super, tuple, type, and zip. You are more likely to use these as variable names by accident than, say, ZeroDivisionError; and this shorter list is a lot easier to memorize.
2.2.2. What is Data?¶
In Python there are seven key pieces of data and data structures with which we will be working:
integers (whole numbers)
floats (decimal numbers)
Booleans (True or False)
tuples (lists that cannot be changed in memory)
dictionaries (key : value)
In the next two chapters, we will explore each of these. While this chapter focuses on data: strings, integers, floats, and Booleans; the next chapter will focus on data structures: lists, tuples, and dictionaries.
Data are pieces of information (the singular is datum)i.e., integers, floats, and strings. Data structures are objects that make data relational, i.e. lists, tuples, and dictionaries. Before you proceed to lesson three, you MUST have a basic understanding of the ways in which you create data in Python and the ways in which you make that data relational through data structures. Start to train your brain to recognize the Python syntax for these pieces of data and data structures discussed below.
Strings are a sequence of characters. A good way to think about a string is as a text. We can create a string in python by designating a unique variable name and setting = to something within quotation marks, i.e. ” ” or ‘ ‘.
The opening of a quotation mark indicates to Python that a string has begun and the closing of the same style of a quotation mark indicates the close of a string. It is important to use the same style of quotation mark for a string, either a double or a single. In the examples below, we have two string objects: a_string and b_string, the former corresponds to the string “Hello” and the latter corresponds to the string “Bye”.
Let’s take a look at a few examples of strings.
184.108.40.206. Examples of Strings¶
str1 = "This is a string."
This is a string.
str2 = 'This is a string too.'
This is a string too.
str3 = "This is a "bad" example of a string"
Input In  str3 = "This is a "bad" example of a string" ^ SyntaxError: invalid syntax
str4 = ''' This is a verrry long string. '''
This is a verrry long string.
220.127.116.11. Type Function¶
It is frequently necessary to check to see what type of data a variable is. To identify this, you can use the built-in function type in Python. To use type, we use the command below with the data we want to analyze placed between the two parentheses.
The output is “str” which tells us that the object is a string.
2.2.4. Numbers (Integers and Floats)¶
Numbers are represented in programming languages in two several ways. The two we will deal with are integers and floats.
An integer is a digit that does not contain a decimal place, i.e. 1 or 2 or 3. This can be a number of any size, such as 100,001,200. A float, on the other hand, is a digit with a decimal place. So, while 1 is an integer, 1.0 is a float. Floats, like integers, can be of any size, but they necessarily have a decimal place, i.e. 200.0020002938. In python, you do not need any special characters to create an integer or float object. You simply need an equal sign. In the example below, we have two objects which are created with a single equal sign. These objects are titled an_integer and a_float with the former being an object that corresponds to the integer 1 and the latter being an object that corresponds to the float 1.1.
18.104.22.168. Examples of Numbers¶
int1 = 1
float1 = 1.1
The term boolean comes from Boolean algebra, which is a type of mathematics that works in binary logic. Binary is the basis for all computers, save for the more nascent quantum computers. Binary is 0 or 1; off or on; true or false. A boolean object in programming languages is either True or False. True is 1, while False is 0. In Python we can express these concepts with capitalized T or F in True or False. Let’s make one such object now.
22.214.171.124. Examples of Booleans¶
bool1 = True
bool2 = True
bool3 = False
bool4 = False
2.2.6. Working with Strings as Data¶
Often when working within Python, you will not be simply creating data, you will be manipulating it and changing it. Because humanists frequently work with strings, I recommend spending a good deal of time practicing and memorizing the basic ways we work with strings in Python via the built-in methods.
In order to interact with strings as pieces of data, we use methods and functions. The chief functions for interacting with strings on a basic level come standard with Python. This means that you do not need to install third-party libraries. Later in this textbook we will do more advanced things with strings using third-party libraries, such as Regex, but for now, we will simply work with the basic methods.
Let’s learn to manipulate strings now through code, but first we need to create a string. Let’s call it str6.
str6 = "This is a new string."
It is not a very clever name, but it will work for our purposes. Now, let’s try to convert the entire string into all uppercase letters. We can do this with the method .upper(). Notice that the .upper() is coming after the string and within the () are no arguments. This is a way you can easily identify a method (as opposed to a function). We will learn more about these distinctions in the chapters on functions and classes.
THIS IS A NEW STRING.
Noice that our string is now all uppercase. We can do the same thing with the .lower() method, but this method will make everything in the string lowercase.
this is a new string.
On the surface, these methods may appear to only be useful in niche circumstances. While these methods are useful for making strings look the way you want them to look, they have far greater utility. Imagine if you wanted to search for a name, “William”, in a string. What if the data you are examining is from emails, text messages, etc. William may be capitalized or not. This means that you would have to run two searches for William across a string. If, however, you lowercase the string before you search, you can simply search for “william” and you will find all hits. This is one of the things that happens on the back-end of most search engines to ensure that your search is not strictly case-sensitive. In Python, however, it is important to do this step of data cleaning before running searches over strings.
Let’s explore another method, .capitalize(). This method will allow you to capitalize a string.
str7 = "william"
I will use this in niche circumstances, particularly when I am performing data cleaning and need to ensure that all names or proper nouns in a dataset are cleaned and well-structured.
Perhaps the most useful string method is .replace(). Notice in the cells below, replace takes a mandatory of two arguments, or things passed between the parentheses. Each is separated by a comma. The first argument is the substring or piece of the string that you want to replace and the second argument is what you want to replace it with. Why is this so useful? If you are using Python to analyze texts, those texts will, I promise, never be well-cleaned. They may have bad encoding, characters that will throw off searches, bad OCR, multiple line breaks, hyphenated characters, the list goes on. The .replace() method allows you to quickly and effectively clean textual data so that it can be standardized.
Unlike the above methods, for .replace(), we need to put two things within the parentheses. These are known as arguments. This method requires two and they must be in order. The first is the thing that you want to replace and the second is the thing that you want to replace it with. These will be separated by a parentheses and both must be strings. It should, therefore, look something like this:
.replace(“the thing to replace”, “the thing you want to replace it with”)
In the example below, let’s try and replace the period at the end of “Mattingly.”
str8 = "My name is William Mattingly."
print (str8.replace(".", ""))
My name is William Mattingly
Excellent! Now, let’s try and reprint off str8 and see what happens.
My name is William Mattingly.
Uh oh! Something is not right. Nothing has changed! Indeed, this is because strings are immutable objects. Immutable objects are objects that cannot be changed in memory. As we will see in the next chapter with lists, the other type of object is one that can be changed in memory. These are known as mutable objects. In order to change a string, therefore, you must recreate it in memory or create a new string object from it. Let’s try and do that below.
str9 = str8.replace(".", "")
My name is William Mattingly
Excellent! Now we have a new string that has been cleaned, but let’s say I am only interested in grabbing what comes after the phrase “My name is”. I have a couple different methods to do this. I could use the .replace() method.
str10 = str9.replace("My name is", "")
Everything looks good when we print it off below, but notice, there is a leading white space before the W. This is uncleaned data. We often want to remove leading or trailing whitespaces so that all data is consistent in our dataset. We can do this by using the .strip() method.
Now, our data is cleaned.
My name is William Mattingly
Strings have a lot of other useful methods that we will be learning about throughout this textbook, such as the split() method which returns a list of substrings that are split by the delimiter, which is the argument of the method. By default, split() will split your string at the whitespace.
['My', 'name', 'is', 'William', 'Mattingly']
As we learn about lists in the next chapter, you will be able to use split to grab specific items from that list. Notice that in the output below, we have the same output as our method of replace and strip above. It is important to remember that in programming, there is rarely one right answer. Usually a problem can be solved many ways, but some are more efficient or easier to parse when another programmer tries to read your code.
print (str9.split("My name is "))
#Pythonic => the standard way to do something in Python
2.2.7. Working with Numbers as Data¶
Now that you understand how strings work, let’s begin exploring another type of data: numbers. Numbers in Python exist in two chief forms:
As noted above, integers are numbers without a decimal point, whereas floats are numbers with a decimal point. This is am important distinction that you MUST remember, especially when working with data imported from and exported to Excel.
As digital humanists, you might be thinking to yourself, “I just work with text, why should I care so much about numbers?” The answer? Numbers allow us to perform quantitative analysis. What if you want to know the times a specific author wrote to a colleague or to which places he wrote most frequently, as was the case with the Republic of Letters project at Stanford? In order to perform that kind of analysis, you MUST have a command of how numbers work in Python, how to perform basic mathematical functions on those numbers, and how to interact with them. Further, numbers are essential to understand for performing more advanced functions in Python, such as Loops, explored in Chapter 03.01.
The way in which you create a number object in Python is the to create an object name, use the equal sign and type the number. If your number has a decimal, Python will automatically consider it a float. If it does not, it will automatically consider it an integer.
num1 = 1 num2 = 2
Throughout your DH project, you will very likely need to manipulate numbers through mathematical operations. Here is a list of the common operations:
Exponential Multiplication **
Modulo % #This will return the remainder, e.g. 2%7 will yield 1.
Floor // #This will return the max number of times two numbers can be divided into each other, e.g. 2//7 will yield 3.
In addition to this, you will very often in loops need to identify Comparison Operators (equal to, less than, etc). Here is a list of those:
Equal to (==)
Greater than (>)
Less than (<)
Less than or equal to (<=)
Greater than or equal to (>=)
Not equal to (!=)
We will address these in greater detail in a later chapter as they are particularly useful when you work with loops.
This chapter has introduced you two some of the essential types of data: strings, integers, floats, and booleans. It has also introduced you to some of the key methods and operations that you can perform on strings and numbers. Before moving onto the next chapter, I recommend spending some time and testing out these methods on your own data. Try and manipulate an input text to locate and retrieve specific information.
If you haven’t yet installed Python, that’s okay! There are free compilers online, including the one I have available at https://pythonhumanities.com/lesson-02-storing-data-in-python/. Once you feel comfortable with types of data and how to create and interact with them in Python, feel free to continue into the next chapter as we explore data structures.
2.2.9. Challenge Questions¶
What would be the output of the command below?
type('What type of data is this?')
What about this?
Input In  print("Hello, world!') ^ SyntaxError: EOL while scanning string literal
And what about this?
person = "John" person = person.lower() person2 = person person = person.upper() print(person2)
Expand below to see why the output looks the way it does.
If you did not get this right, let’s talk about why. Remember, programming is a sequential set of operations.
person = “John”
establishes that the variable is a string and that it is “John”
person = person.lower()
this will lowercase the string so at this stage, it would be “john”
person2 = person
this creates person2 as a variable and points it to person
person = person.upper()
this changes the person variable and uppercases it. The variable person2, however, remains the same