In developing BigBigBook.com, I’m learning the hard way that there’s a reason every Karaoke book in every Karaoke dive bar you’ve ever been to is riddled with errors and is an utter mess… it’s because maintaining these books is REALLY hard. Way harder than I ever anticipated it being when I first set out to do this project. There are a number of challenges to overcome in order to make this database the best database of Karaoke songs on the planet, but it will get there!
Sheer Volume of Data – Each Venue typically has anywhere from 60,000 to 120,000 unique songs, but a lot of the popular songs are in their lists anywhere from 2-12 times, giving me sometimes 300,000 or more records to sift through for a single venue alone. Sometimes those songs are filed in way that are spelled slightly differently, sometimes abbreviated, “fat finger” misspelled, or with extra tags or notes attached to the titles or artists. By the time I had 10 or so venues recorded, I had close to a million records in the database, with varying conventions for how the original creators had routinely filed or misfiled things. This is too much data for one human to sift through, so much time has been spent building computer algorithms to do most of the work as automatically as possible… but still, the output of the computer algorithms needs to be examined and tested, and systems have to be put in place to redirect and steer the algorithms when things don’t fall in line perfectly. This is turning into a machine-learning/AI project, which I’ll mention later, but every ML/AI project needs a stellar dataset full of classifications a.k.a “correct answers” from which to learn from… so there’s lots of human work that goes into training the computer models.
Even the Artists themselves can’t decide their own names – Is it Puff Daddy? P. Diddy? Or whatever he goes by these days? Is it Kesha? Or Ke$ha? Is it “Smashing Pumpkins” or “The Smashing Pumpkins”?
Should I skip over insignificant words like “The” at the beginning of names? Seems like the obvious answer should be “yes”, right? Okay, but what about the spanish equivalent, like “Los” or “Del”… French, German variants?
It was a mistake to tackle sorting by surname – I always thought it was annoying that every music service filed “Christina Aguiliera” under “C” rather than “A”… so I thought to myself as I was setting sail on this project that I would be the first person in the history of the internet to actually file the artists properly…. oh boy am I regretting this!! Why? For starters, it can be really really difficult to determine what a surname actually is, and no computer algorithm can just simply figure it out. We think of the obvious ones, like “Michael Jackson” should be “Jackson, Michael”… sure that sounds easy. But what about the aforementioned “Puff Daddy”… well.. certainly “Puff Daddy” is a nickname, a stage name, and there’s no way a reasonable person would want it filed “Daddy, Puff”. Determining when a name is a surname or a nickname can feel like a completely arbitrary task at times and there’s no official database out there that I can rely upon to make that determination. Furthermore, non-English names are their own rules for sorting, traditionally (I think… but I’m too busy to figure it out right now).
Is it a band name, or an artist name? One of my favorite bands from Minneapolis was named “Walt Mink” … and Walt Mink was always filed under “W”. They thought it would be funny to name their band after their favorite college professor by the same name, but since it was a band name, the record stores were always instructed to file them under “W” and not “M”. Go figure. This gets even more complicated when dealing with artists such as “Tom Petty and the Heartbreakers”. My seed databases might have songs filed just under “Petty, Tom”, or “Tom Petty”… and even more complicated is that Tom Petty actually released solo albums not crediting “The Heartbreakers” in the title (even though it was still basically the same band). Most people don’t even realize that “Free Fallin'” is a Tom Petty song and not a Heartbreakers Song… which means that my seed databases also misfiled a lot of Tom Petty songs (technically). So what do I do then? Is Tom Petty a different artist than “Tom Petty and the Heartbreakers”? or is it “Petty, Tom and the Heartbreakers”? or “Petty Tom and Heartbreakers, The”? All of these fillings are present in my source databases. Should I set “Heartbreakers, The” as their own artist record and file everything as essentially “Petty, Tom” featuring “Heartbreakers, The”? For the record this is what I actually ended up deciding to do, as my database aims for convergence, not divergence of records. I did this not because of Tom Petty, but because of a different artist, Prince. Prince has bands, “The Revolution” and “The New Power Generation” backing him up, but most people just think of “Prince” as “Prince”… even if he released a bunch of albums as “the artist formerly known as Prince” *facepalm*. In BigBigBook.com you should be able to find all Prince songs grouped together, with also credits for his various bands.
What happens when an artist uses their front-man as part of the band name? *double facepalm* For example, J. Geil’s Band and “Dave Matthews Band”? You’d think it to be convenient to have all the Dave Matthews songs grouped together with his solo and featured stuff, so maybe you’d be inclined to file him as “Matthews, Dave Band”… but now that just looks weird. Artists who have their “bands” also are featured in collaborations with other artists or as solo artists, so filing “Dave Matthews Band” under “D” would separate it alphabetically from anything featuring Dave Matthews filed under “M”. Then there’s the even weirder ones… like “Andre Kostelanetz and His Orchestra”… file under A? K? God I dunno. Is it “Folds, Ben” or “Ben Folds Five” or “The Ben Folds Five” or “Folds, Ben Five”?
Tom Petty and Stevie Nicks did some stuff together, and nobody seems to agree if Tom Petty or Stevie Nicks should be filed as the primary artist for “Stop Draggin’ my Heart Around”… or is it “Stop Dragging My Heart Around”?
What to do about collaborations in general? One of the latest challenges I’ve been working through is how collaborations are handled. The collaborations get really out of hand sometimes, “Akon featuring Pink, Mya, Eminem etc. etc. etc.” The seed databases sometimes omit the some or all of the secondary featured artists, and sometimes cannot agree on who the primary artist is! Is “Uptown Funk” a Mark Ronson or Bruno Mars song? Some collaborations involve 4 or 5 artists, and the submissions I’m working through are inconsistent as to how each individual artist should be listed and whether they are separated by commas(,), ampersands (&), or just spaces. The individual artist names are further complicated by the surname sorting of each individual artist. This mess is probably the biggest mess I have to deal with, and the last major hurdle to getting this database squared away. I have literally spent weeks trying to come up with algorithms to pick these artists apart, so if you’re not noticing any fancy new bubbly graphics… the reason is that I’ve been lost in the details of the song lists themselves…. working on them every day.
Enough of the complaints, what’s the good news?
Well the good news is that I already have the largest and most consistent seed database of the entire music industry created that features not only Karaoke songs but general releases. It is seeded by the Karaoke books from 20 different venues plus 5 or 6 open-source datasets pulled from the internet community. Not a single one of those data sources individually was anywhere close to complete, let alone accurate, so I curated and corrected the data with a list of 85,008 corrections and counting. Even if all the records are currently not perfect, all the records in the database are assigned various metrics of “quality”… which will feed an AI-powered algorithm, currently in development. As the database continues to improve, it should be powerful enough to look at any unstructured data and tell you whether a particular song is represented or not. I should be able to point it at a jumble or words and ask it if this list contains “Toxic” by “Britney Spears” for example. As the database improves, and as I grow BigBigbook.com into a site that covers hundreds or even thousands of Karaoke venues, these new algorithms will be used to compare, contrast, and celebrate karaoke in communities all over the world!
I second the amount of work you’ve put into this. Can machine learning help in identifying some artists’ naming patterns?
Machine learning could definitely assist in identifying patterns and making data sorting more efficient!
Machine learning, especially supervised learning approaches, could indeed be useful in discerning artists’ naming conventions. It would require a training dataset though.
Yes, spot on about needing a training dataset. This could potentially resolve a good chunk of white noise in artist name variants.
Absolutely! Machine learning can use classification to identify patterns in artist names. Would it be efficient to use a subset of data to ‘train’ the algorithm?
Absolutely, Hannah! Training the algorithm with a well-curated subset would help in enhancing its predicting efficiency.
I agree with your query. Developing a subset of data to ‘train’ the algorithm sounds like an effective strategy. It’ll improve precision.
That’s a top-notch approach, Emily. Thinking along those lines, wouldn’t additional metadata improve precision too?
I absolutely concur. Extra metadata could provide important context, enhancing the algorithm’s functionality. It’s a fascinating concept.
Totally, dude. Context is king, especially with those tricky group names!
Right? The nuance of band vs. individual artist names is a real brain teaser.
Ya got that right! Lol, context could avoid the “Petty, Tom Band” debacle.
“Petty, Tom Band” – the utter pinnacle of music sorting logic. Lol.
I know, right? Poor Tom probably never saw it coming. Lol.
Spot on. That metadata could clarify a ton for those outlier cases. Algorithms need it.
True, context is key. Handles outliers like solo vs band releases!
Yep, context separates “Petty, Tom” from “Tom Petty and the Heartbreakers”. Important stuff.
You’re bang on, mate. Having context is like the secret sauce for database accuracy.
Secret sauce! Love that analogy, Liam!
Exactly! A well-labeled training dataset would make a world of difference.
Forget training, the real struggle is ambiguity.
True, ambiguity adds complexity. Solutions?
Humans struggle too.