As some of you know, I’ve been doing semantic analyses of wine reviews we receive online. Mostly, I’ve used this data to make silly computer-generated wine reviews. But today I’m going to use the data to talk a bit about word clouds and word frequency.
Robert Parker’s most used words
Robert Parker is one of the most influential wine critics on earth and he popularized a one hundred point rating scale which dominates the US wine market. An American named Tom Wark did some data gathering about Robert Parker’s perfect scored wines. Basically, he looked at the 224 wines that had received a perfect score of 100 from Robert Parker.
Wark published the list of words that appear the most in tasting notes for 100 point wines. This should give us some insight into what sort of characteristics appear in wines that Parker thought of as perfect.
For words like “Elegan” or “Intens”, the reason they cut off like that is because Wark grouped Intense, intensely, intensity, and other nearly identical words into one word group labeled simply “Intens”. Fair enough!
What we get is that Parker uses the word rich a ton when he tastes a wine that merits 100 points out of 100. Intensity, concentration and spiciness also come up a lot. Minerality, massiveness, balance, complexity and length are also in there.
I think this is a really fun idea. Because I’m a data nerd.
Customer comments – Tastes Like Wine
So Parker often describes “perfect” wines as rich, intense and concentrated. What words do my customers use most?
Yes, rather hilariously, the most used words are Taste Like Wine. Not together mind you.
So I did an analysis of customer comments regarding Trah Lah Lah 2008 on Naked Wines, an online wine retailer that represents and promotes us in the UK. The word cloud above is a graphical representation of the words used most frequently in reviews, and the most common words appear in larger font size. I generated the word cloud above using wordle, although I did move some of the words around in a graphic program later on to emphasize the tastes like wine joke. But the size of the words is accurate! I just moved them to the top of the cloud. Wordle also automatically removes definite articles, personal pronouns, possessive adjectives and certain other words that are more about syntax than meaning.
Now, there is a huge difference between what Naked Wines customers say about Trah Lah Lah 2008 and what Robert Parker says about wines he rates as 100 points, namely because very few of the comments wine drinkers left on Naked are in “tasting note” form. Instead of striving for journalistic, objective tasting notes about richness or spice, people tend to write about their whole wine experience. It seems pretty normal that the most used words include “taste” “like” and “wine”. 😀 Personal pronouns and possessive adjectives (I, me, our, its) appear much more frequently.
Here is a list of the words that got used most (I think I might have taken out all the definite articles and certain words that only serve syntax) and the number of times that word appeared.
- I 94
- wine 52
- not 32
- really 23
- bottle 22
- we 21
- again 20
- you 20
- good 19
- very 18
- like 18
- some 18
- my 18
- taste 18
- buy 15
- french 15
- red 14
- more 14
- me 13
- just 13
- if 13
- well 13
- quite 12
- one 12
- first 12
- bit 10
- better 10
- too 10
- all 10
- wines 10
Is there a meaningful difference between Parker 100 tasting notes and Naked Wines customer comments?
So there is a huge difference in which words appear the most. But is this a meaningful difference? Well, for the most part, this is not a good comparison. But it is a very fun comparison and it inspires certain ideas.
For one thing, why are tasting notes built the way they are? Why do wine critics try to objectively describe flavors and odors in wines?
When they do try to refer to the overall experience of the wine, why does their vocabulary focus on richness, depth, complexity and so on? Wine drinkers don’t think this way (at least not according to this small sample from Naked Wines customer reviews of Trah Lah Lah 2008).
Again, this isn’t really a fair comparison because tasting notes aren’t the same as customer comments. Tasting notes are specifically built to describe the experience of a wine. Customer comments can be anything. They can be about an overall experience, they can be about a specific pairing the person tried, they can be simpler statements (eg I liked it, I didn’t like it), they can be congratulatory or simply grateful (eg Thanks!, Good job, guys!). This means that customer reviews won’t limit themselves to particular vocabulary like tasting note jargon.
Now, even if we limit the analysis of customer comments to only the descriptive words (like rich, intense, etc.) we get a list that’s pretty far from Parker’s. The most common are Really, Very, Good. 😀 Of course the statistics can be a bit misleading since Not is even more common than those! The first descriptive words that appear on the list which might be described as more precise are “French” and “Red”. 😀
Also, I’m only using the 100 point scores from Parker but I’m using all comments for my Trah Lah Lah 2008 on Naked Wines. One might argue that the reason Trah Lah Lah comments don’t have the word rich is because the wine is not 100 points. So I will admit right here and now that this is bad science. This is not a perfect comparison. However, it still illustrates my notion that wine critics use a vocabulary that is actually somewhat foreign to the average wine drinker.
You can also argue that wine drinkers lack the refinement or courage to say things like “intense and deep” while it’s very easy to say “tastes like good wine”. But I think that’s my point. Regular wine drinkers don’t necessarily understand or relate to tasting notes like “unctuous”. Maybe wine communication should use vocabulary more familiar to wine drinkers. How would most drinkers react if the back of a bottle said “This is a French red wine and it tastes good and could use some food”?
Apology and shaking my fist at Stephen Colbert
I was going to post these word clouds later with a lot more analysis of Parker’s reviews.. I would also like to do word clouds of Parker’s ediotrial content (instead of straight up tasting notes) and even do some for other critics and journalists. But Stephen Colbert recently beat me to the punch and I hate it when Stephen Colbert steals my ideas!!! 😀
I promise to talk about all of this in more depth and with more rigor if I get chosen to present at SXSW in Austin next year. The talk I suggested is about data analysis, reinterpretation, visual representation, infographics, and all sorts of other stuff that might help people in non-verbal jobs like wine communicate with the rest of the world online.
Do you ever get the feeling that wine critics are making up words or inventing fruit you’ve never heard of to describe wine? This morning, I took a sidestep in my computer-generated wine reviews project. Instead of generating whole reviews, I am now generating new words to describe wines. Here is a list of words that the computer generated to describe O’Vineyards Trah Lah Lah 2008. The hope is that they all sound vaguely real.
List of Computer Generated Wine Terms
To spice things up, today I’m highlighting computer generated words rather than whole reviews. This means the n gram analysis focuses on letter pairs and letter triplets instead of word pairs and word triplets. If you have no idea what I’m talking about, refer to the simplified explanation in my first post about computer generated reviews. Basically, the computer looks at what letters commonly appear together and it makes up words based on the statistical probability of random letters appearing near each other.
The list starts with words that strictly follow the analysis (high similarity to actual letter pairs in real reviews of Trah Lah Lah 2008) and it slowly descends into the bowels of vaguely human-sounding language (low similarity to actual letter pairs). All capitalization and punctuation was generated by the algorithm.
Perhaps of special interest, the computer generated the word “commend” even though that never appeared in the reviews. It also got a couple of real french words like “vraiment” and “cours”.
I definitely want to add some of these automatically generated words to my wine vocabulary. I wonder how long it will be before somebody calls me out for using made-up words like vinegativity, mell and bood.
This wine is quite differench. Extremendously bracked attack. Midpalate is dominated by gravinter with some notes of refunky vinegativity. Mell with a measive finish that reminds me of cracket cherritory.
I’m still tweaking the parameters for my computer-generated wine reviews.
Some computer-generated reviews:
“Delicious, deep flavours.”
While this is in no way funny, it’s sort of spectacular. Nobody actually used this exact phrase in the wine reviews. But somebody said “Delicious, deep and dusty. It should cost more.” And somebody else said “Rich deep flavours and a long finish.” And the computer sussed out that it could say “Delicious, deep flavours.” It even got the punctuation and capitalization correct. It’s fun to focus on the zaniest reviews the computer generates. But some of these boring ones are actually much more impressive.
“really, really solid quaffing red. It tastes True again. Nice wines. Thanks again. Good effort”
I like this one for all the reasons mentioned above. The simple parts are remarkably accurate. And the note that a wine tastes True again is amazing. You could actually get away with saying that in a review. Although I think if I had a greater respect for line breaks, there would have been a big gap betweent tastes and True. I’ll look into that.
“The 2008 Trah Lah Lah Lah Lah Lah Lah. No, sorry.”
Lest you think the computer only generates positive reviews of my wine… 😀 Aside from being a hilariously curt negative review, this also demonstrates one of the most amazing things about recursive analysis. My wine is called Trah Lah Lah. So the computer has about a 50/50 chance of saying the whole brand name any time it decides to say Trah. Trah is always followed by Lah. And Lah is followed by Lah about half the time. And by a period or another word about half the time. So you see a lot of Trah Lah and a lot of Trah Lah Lah in the generated reviews. But occasionally, you get lucky and the computer just strings together a ton of Lah Lah’s. If I were using trigrams, this probably wouldn’t happen as often. But for now, here we are. And actually, in this particular negative review, it sounds like they’re making fun of the name of the wine.. so it’s perfect!
“Gorgeous fruity New World Wines, with their ‘old fashioned’ flavour”
Program I used
I’m using Gibberizer for now. I might write something on my own later, but for now it’s all thanks to this beauty: http://code.google.com/p/gibberizer/
The settings are
- Read input as: Lines
- BatchSize: 1
- Similarity: 7
- Persistence: 5
- Disallow input echo
- Disallow duplicates
What changed since the last post?
If you read the last post on this subject, you’ll probably notice that these reviews are much more sensical. So what’s different?
First and foremost, I changed the data input. Instead of feeding the last 100 comments I received on Naked Wines, I submitted only tasting notes for the 2008 Trah Lah Lah. That’s 113 reviews. They tend to be a little shorter than comments, so the data file is about the same length, but all the language is about drinking wine. This means that the computer generates fewer comments about technical aspects of the website like the MarketPlace and the vineshare program we’re running.
Don’t get me wrong. I’d love to generate comments of that nature too. But I just need way more data for that to work. Tasting notes are easier because even the real ones sound a bit like gibberish… and people often get so drunk while tasting the wine that the reviews tend to be a bit slurred by the end.
I should also mention that lots of the reviews are still total gibberish.. for example:
“A bit tannins as well. As a Rhode Islander to breathing” for a good 🙂 will buy again.”
Work in progress!