Categorizing Tags

April 22, 2024

Additional Pain

Since I wanted to analyze not only the ships but also the tags in my datasets, which provide the most insight into the interests and needs of AO3 users, I set myself the ambitious task of categorizing the additional tags.

Because all tags are unique and not categorized into specific genres, and they can be entirely individual to the work, I couldn’t answer my original question if I accepted the tags as they were in my dataset. Therefore, I thought that categorizing the tags would provide better insights and allow for more meaningful analysis.

But how do I categorize tags to draw meaningful conclusions from them? Categories that are too general, like genre, emotions, or tropes, are not very informative. How many categories are acceptable? Many tags are very specific. What can I group together, and what should I separate?

So many questions and no definitive answers.

I considered several categorizations, starting over and changing them repeatedly. It took a long time to find something that fit.

From using Tarot cards as categories to extremely generic and superficial categorizations, I never got far.

I consulted AI and discussed what made sense over an extended period. I tried using NLPs to automate the task entirely, but nothing worked or felt right.

Then I decided to break down the tags and analyze the frequency of words within the tags, hoping this would provide more insights.

If I thought manually entering 800 characters was a lot, I had another thing coming. One dataset alone had 7000 additional tags.

So, I had to take a step back and think about what made sense. I decided to reduce my dataset, which I also had to partially fill out manually. I chose to only include tags that appeared at least twice.

Many tags were fandom- or character-specific, making categorization even harder. How do I categorize something like “BAMF! Peter Parker”? So, I decided to remove all fandom-specific tags, whether they mentioned a character or were unique to that fandom, from the categorization process.

Unfortunately, this also had to be done mostly manually. I did my best to search for and filter out all characters and terms in the dataset that could be part of a fandom. This approach proved effective, allowing me to handle multiple rows at once more quickly. With a more streamlined dataset containing only additional tags without fandom-specific content, I applied the same principle and searched for words that were meaningful and could occur frequently enough for categorization.

Just like with the characters, I started with the overall tags and compared them with the remaining datasets, so I had less categorization work with each subsequent dataset.

With a more manageable dataset, I could again enlist the help of ChatGPT to categorize the remaining tags. One could debate for hours how meaningful these categorized data are, but I believe that even if it can’t be 100% accurate, it still provides a good overall picture of the fandom’s mood.

Through this process, I discovered a lot about myself, such as the fact that I can stare at an Excel sheet for 20 hours straight and still function. However, this process also brought me closer to understanding NLP, and I now have a basic grasp of how it works. Even though many things did not go smoothly in my case, I am glad I had this time and experience. It has prepared me better for any future encounters with such data.