Character Cleaning

April 15, 2024

Data Inaccuracy

By the end of the project, cleaning the data remained my primary task. During the analysis, it quickly became apparent that accurately analyzing the data would be challenging. Since the metadata came from AO3, where all users could individually input and write their data, there were numerous inaccuracies and deviations that couldn’t be easily resolved by machine. For instance, within a single dataset, over 800 different characters were mentioned, many of which were recorded multiple times in various forms.

Tom Riddle
Tom Marvolo Riddle
Voldemort
Lord Voldemort
Tom Riddle | Voldemort
etc.

With these inaccuracies in the data, I could not accurately analyze my data, as I did not want the same character to appear five times in my network, as it would distort the results.

Initially, I attempted to manually address these issues in Excel by searching for characters and aliases that were familiar to me. However, I soon realized this approach was unsustainable.

I then subscribed to ChatGPT Premium, hoping AI could assist in solving my problem. Unfortunately, even for AI, it was too large a task to clean the data thoroughly and quickly. Therefore, I began to go through the data line by line.

Characters were not the only dimension with this issue; relationships and fandoms also had similar problems. As I was already going through each row individually, I decided to add fandom and gender information to each character, which would help me categorize the relationships into the relationship type and identify the fandoms of the characters involved.

With the help of ChatGPT and manual searches for characters on Google, it took me two days to clean a single dataset. These two days ended up being somewhat wasted, as I had to repeat the process at least two more times.

In addition, I tried to find datasets online that already had character, fandom, and gender information for each fandom I wanted to cover to avoid filling in this information manually using Google or AI. Unfortunately, this approach also did not work well, as I could not find suitable datasets for every character, and the character names were often different again. Therefore, I returned to manual AI and Google search.

My first mistake was starting with the wrong dataset. Due to my lack of knowledge, I began with the dataset I was most familiar with (Harry Potter). However, I later realized I should have started with the Overall Top 1000 dataset. This would have saved me from duplicating a lot of work in the other three datasets.

The second mistake was focusing on the character list instead of the characters mentioned in the relationships and their respective spellings. This was crucial for implementing my idea effectively.

Steps Followed

1. Extract all characters from the relationships in the Overall dataset.

For this, I first used the script that reads out the relationships individually with their frequency. From here, I cleaned these characters in Excel and separated them with the delimiter / or &.

2. Clean these characters and add information about their gender and fandom.

Here, I further processed the dataset with the help of AI and calculated the frequency for each character, so each character was mentioned only once as written. After obtaining a more “manageable” dataset, I added three columns for each character: the first served as an ID for each character, the second for gender, and the third for fandom. I obtained this information using AI and Google search.

3. Replace the relationship dataset with the cleaned names.

Here, I used the cleaned character dataset and the relationship dataset with VLOOKUP in Excel to find the characters and insert the IDs to later identify where each character was actually present.

4. Combine information about characters, ships, gender, and fandom.

In this step, I wrote a Python script that combined the two datasets into a JSON for further processing.

5. Cross-check all entries with the other datasets to avoid redundant work.

Since I had four datasets, I first handled the Overall Relationships to avoid repeating my initial mistake, then moved on to the next dataset, using VLOOKUP again to insert characters and information.