Introduction to Python

1.1. Introduction to Python#

1.1.1. Why Should Humanists Learn to Code?#

Before we begin this book, let’s begin with a simple question. Why should humanists learn to code? To answer it, let’s consider the ideal digital humanities project.

In an ideal digital humanities project, a humanist (or team of humanists) will leverage their domain knowledge, or area of expertise. As a domain expert, this project leader, better known as principal investigator (PI), will convey their idea to their team which will consist of others in their field and usually a technical lead. This technical lead will (depending on funding) either handle all technical aspects of the project (web development, coding, data management, etc.) or lead a team to handle all technical aspects of it. This is the ideal scenario. Each person on the team does what they do best.

Such a project requires two things that are in high commodity to most researchers: funding and time. The self-reliant humanist who can leverage their domain knowledge while also being able to function in a technical capacity on a project drastically reduces both of these issues in several ways. To understand how, we must first understand how digital humanities projects receive funding.

1.1.1.1. The Timeline for Starting a Large Digital History Project#

Large digital projects often take years to begin. These projects often start with a researcher (or researchers) wishing to explore a question or create a digital platform for an aspect of their research. These scholars will then reach out to others in their field to gauge interest. This process can take several months to a year, especially if the researchers wait to present their idea at the next conference in their field.

If there is an interest in such a project, the project PI will find a technical expert who can tell them if it the project is possible and receive an estimate for its cost. If the PI does not have a technical expert and cannot find one associated with their university, this can be challenging and time-consuming. As we will see, the relationship between PI and technical lead is a unique bond that requires good communication from both parties. Selecting the right technical lead is important because their pay may be one of the larger budget items. It is also the person who makes the PI’s vision a digital reality.

Next, the PI will create a list of funding opportunities for the project and then begin applying to them. Each application can take several months to complete. Once complete, the PI will then usually submit the application to the granting association through their university. If there are multiple PIs on the project at different universities, this process can be more complicated as all institutions usually need to communicate with each other.

The university will usually review all materials. They will be especially interested in the budget. This process can take a few weeks while the university ensures all numbers are accurate. After all, most universities will be taking a cut of the grant money (sometimes up to 30%), so they want to make sure they see the numbers first.

After this, the university will submit the grant. Depending on the granting agency, the wait time can be anywhere from a few months to nearly a year. During this time, the digital project will may not move forward much, if at all. Rather, it may sit in a holding period while the PI waits to hear back.

When we look at this timeline, it means that from initial idea to the start of the project, even in the ideal scenario, it can be realistically 2-3 years before the project begins. All of this comes with a bit of a gamble as well. The PI believes in the idea, they have also gauged the community to know that there is interest in it, and they have found a technical lead who can do the job, but a lot can happen in 2-3 years. In that time, new developments in the domain expert’s field may have altered the project trajectory (or, worst-case-scenario, made it redundant). Further, new digital approaches/methods may mean that an entirely new approach is necessary. This is especially common today if the project involves machine learning, a field in which new advances are happening every week. In addition to this, the technical lead may have taken a job elsewhere. (This does happen as they are not contracted until the grant is received). Finally, there is no guarantee that funding agencies value the idea. This means that the time spent designing the project conceptually and preparing to get started could be entirely in vain.

By learning to code, you can often negate this entire process by testing your idea within days or weeks. This means that you will be able to test the validity of the idea and potentially even create a minimum viable product (MVP) that can be used to assist in obtaining funding. An MVP is a working subset of a project to demonstrate its validity. For a digital humanities project, this may mean a simple front-end application or a subset of the data prepared a specific way. MVPs are quite common today in places like Silicon Valley, where MVPs are used to present an idea to investors. This helps give a real sense of how the project will work. While not all mechanics will be available in an MVP, it is a good proof-of-concept. It also demonstrates that the project is possible, the team can do the work they are promising, and that the team already has a good handle of the potential issues that may surface. These are all things that granting agencies like to see. Throughout this textbook, you will develop the skills necessary to create an MVP that you can present to a granting agency. This is especially true with the final chapter, where we learn how to build cloud-based Python applications with Streamlit.

1.1.1.2. The Self-Reliant Digital Humanist#

Ultimately, the ability to code turns the humanist into an entirely self-reliant researcher. By learning to code, you can do far more than test out ideas and create MVPs. You can significantly reduce the cost of a digital humanities project. As a domain expert, you know your field well. As a coder, you will know what is technically possible. If you end up gaining an expertise in data science or machine learning (something that is quite realistic today as both fields are more accessible than ever), you can leverage some of the most powerful methods in technical fields within a digital humanities context. You can also eliminate many issues that may surface during a digital humanities project.

One problem is communication. The project PI needs the ability to communicate their ideas to the technical lead. This communication often occurs in a hybrid language that changes from team to team. The PI over time can begin to use technical terms and the technical lead will begin to use some domain-specific terms. This relationship and communication takes time to create. By having a strong command of the technical aspects of a project, the PI can either eliminate the need for a technical lead entirely or communicate their ideas to the technical side of the project in technical language.

Another issue that vanishes is the lag time from idea to execution. Once a funded project begins, the PI will guide their vision and convey each step to the technical lead whose job it is to make that vision a reality. There is a lag that exists here. Often, meetings on digital projects occur on a regular basis, sometimes weekly or bi-weekly. During that time, issues may surface that are placed on hold until the next meeting. This can put a digital project behind schedule. If the humanist is self-reliant, however, they can develop their idea independently of a technical lead. This means that when an issue surfaces, the PI can immediately rectify it and communicate the solution to the technical lead.

Third, technical leads on projects usually absorb a lot of funding. Hiring a programmer can be costly, but if your project requires data science and machine learning, experts in these areas can costs hundreds of dollars per hour. If the PI has the technical expertise, they can perform technical aspects of a project until enough money is received to hire a technical lead.

1.1.1.3. Other Benefits#

Learning to code gives the humanist tools beyond simply the ability to code. It fundamentally alters how the humanist views their field and the questions they can explore. Things that seemed impossible before, will suddenly appear quite simple to solve.

One of the greatest advantages of learning to code is automation, or the process by which we write rules for a computer to perform a repetitive task. As humanists, we do a lot of things repetitively. Many of the tasks we need to perform as humanists are repetitive and we, as humans, are prone to make mistakes. Being able to automate these tasks can radically reduce the amount of time we spend performing repetitive tasks, from hours, weeks, or even years, to seconds, minutes, or hours. Imagine having to grab data from an archive website. Imagine that the information you need is found on 2,000 different pages. How long would it take you to go to each of those pages and copy and paste the text into a Word Document? In Python, that task can be coded in minutes and left to run for an hour. This allows you, the researcher to go off and do some other task more essential; or, perhaps, enjoy a nice tea break in a hammock so that when you return to analyze the documents, you will be well-rested.

Coding does not just allow us to automate tasks like this. It also allows us to systematically clean data. Imagine you wanted to search across PDF scans of medieval Latin. This problem presents many key issues that make such a task impossible or unreliable. First, medieval Latin did not have any firm spelling convention. This means that some scans may have variant spellings of words. Second, Latin is a highly inflected language, meaning word order is not important, rather the ending of a given word is. When combined with variant spellings, this means that each word can be represented sometimes hundreds of different ways. In addition to this, your texts are scans. Are those scans even searchable? If they are, was the OCR, or Optical Character Recognition, accurate? If it was done prior to 2015, it likely is not. If after, then the scans may be in a bad enough state where the OCR is not accurate. In addition to these problems, any OCR system will retain line breaks, meaning if the key word that you want to search for is hyphenated because the word is broken up between two lines, you need to account for that in your search. Next, we need to take into account editor notations, which often are noted in brackets, parentheses, and carrots. While this example is certainly a complex one, it is perfectly common. And while I am using Latin to demonstrate a greater issue with inflected languages, these same issues, especially those around OCR, surface with English texts as well. Coding allows us to address each and every one of these issues, some more easily than others.

The issue of searching is further complicated when this needs to be done for many documents simultaneously. Imagine if these issues surfaced in 5,000 different PDFs that you needed to analyze. Could you realistically run all these searches across 5,000 documents? If you could, would your results be good? To answer the former, yes, if you are willing to spend months doing it; to answer the latter, likely not. Programming allows for you to develop programs that perform all these tasks across all 5,000 documents. When you run a search, you will not simply hit ctrl+f. You will code your own search method so that your simple search can return very complex results that accounts for variance in the text.

Python makes all of this possible.

1.1.2. What is Python?#

Python is a programming language. Programming languages are ways that we, as humans, can write commands that will then be executed by a computer. There are many different programming languages available for humanists to choose from:

C
Python
JavaScript
R
HTML (this is debatable)

Not all programming languages are created equal. Some are best used for developing software, such as C; others are best suited for web development, such as HTML and JavaScript. And others are great at statistical analysis, R (and Python). Where does Python excel? Python excels in many areas. One thing that sets it apart from other programming languages for new coders is that it is easy to read and easy to write.

It is relatively easy to write compared to other programming languages because the syntax, or way in which you write tasks in code, is straightforward. It is easy to read because Python uses forced indentation. This means that blocks of code that can be difficult to read in other programming languages, are easily spotted in Python.

Since it’s creation in the early 1990s, Python has soared in popularity which has, in turn, resulted in a large community of programmers and a large number of libraries available. We will learn about libraries later in this textbook. For now, think of libraries as large quantities of code that have already been written for you so that you can write 1 line of code to perform a complex task that may take hundreds or thousands of lines of code to write.

Today, Python is one of the most widely used programming languages and is considered the essential language for text analysis, machine learning, data analysis (alongside R), web scrapping, and much more.

Python (as of this writing) is currently in version 3.11.0. Let’s break each of these numbers down. The 3 refers to the main Python language. Python 2 still exists (in fact it comes standard on all Macs), but it is no longer supported and is slowly being replaced by Python 3. You should not invest time in learning Python 2 as it is only used to support legacy software and has a high number of security issues due to its depreciated status. This is important to note because certain things are coded differently between the two programming languages. If you were to look for help on a coding forum, such as StackOverflow, it is important to know about these distinctions.

The 11.0 in Python 3.11.0, tells us specifically what version of Python 3. Not all code from certain libraries is backwards compatible, meaning some libraries require a specific version of Python, so understanding this now as a concept will save confusion later. Each time a new version comes out, new features are available, so it is important to stay up-to-date with current versions, but it is not always essential.

1.1.3. Why Python?#

For all the above reasons, I encourage humanists to learn Python as their first programming language. It is one of the easiest languages to learn, straightforward to write, and can solve most of the programmatic problems a humanist may face. The large quantities of libraries and tutorials mean that there are few tasks that will prove impossible to do.

If you are sold on Python, then continue reading this textbook as we learn how to install it and use it as humanists.