![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I'm not a huge fan of Hacker News[1]. My impression continues to be that it ends up promoting stories that align with the Silicon Valley narrative of meritocracy, technology will fix everything, regulation is the cancer killing agile startups, and discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves.
But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:
(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)
Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:
A few things to note:
This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.
(Raw data is here, penalised stories are here, unpenalised stories are here)
[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?
But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:
- If the story is younger than any other story
- and the story has a higher score than that other story
- and the story has a worse ranking than that other story
- and at least one of these two stories is on the front page
(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)
Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:
Keyword | Penalised | Unpenalised |
Women | 13 | 4 |
Harass | 2 | 0 |
Female | 5 | 1 |
Intel | 2 | 3 |
x86 | 3 | 4 |
ARM | 3 | 4 |
Airplane | 1 | 2 |
Startup | 46 | 26 |
A few things to note:
- Lots of stories are penalised. Of the front page stories in my dataset, I count 3240 stories that have some kind of penalty applied, against 2848 that don't. The default seems to be that some kind of detection will kick in.
- Stories containing keywords that suggest they refer to issues around social justice appear more likely to be penalised than stories that refer to technical matters
- There are other topics that are also disproportionately likely to be penalised. That's interesting, but not really relevant - I'm not necessarily arguing that social issues are penalised out of an active desire to make them go away, merely that the existing ranking system tends to result in it happening anyway.
This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.
(Raw data is here, penalised stories are here, unpenalised stories are here)
[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?
Maybe better data
Date: 2014-10-30 03:23 pm (UTC)You're doing god's work. You may find https://github.com/sytelus/HackerNewsData a better source of data
Re: Maybe better data
Date: 2014-10-30 03:26 pm (UTC)no subject
Date: 2014-10-30 03:24 pm (UTC)no subject
Date: 2014-10-30 03:28 pm (UTC)no subject
Date: 2014-10-30 04:00 pm (UTC)You said: "But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues." However, the conclusions you're drawing seem to suggest that your innate prejudice is that such stories are *disproportionately* penalized compared to other types of off-topic stories. I'd suggest the alternate hypothesis that there's a rough set of on-topic topics that tend to make the cut, and most other things get penalized.
Based on analysis of the data you posted, I don't see anything obvious to support your conclusion that social justice issues are drastically *more* likely to get flagged than other off-topic stories. I'd certainly agree that they get flagged right along with piles of other off-topic stories, and apparently a very large fraction of *on-topic* stories.
In other words: HN's algorithm tends to not show its readers stories they don't want to see, and tends to show its readers stories they do want to see. You could reasonably argue about how that creates an echo chamber; see also DuckDuckGo's various discussions of "filter bubbles".
In short, many hackers are not necessarily interested in hearing about "discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves" in every source of information that they consume. Personally, I don't watch the TV news for vaguely similar reasons. I simply don't have the spoons to deal with it 100% of the time.
no subject
Date: 2014-10-30 04:06 pm (UTC)no subject
Date: 2014-10-30 04:26 pm (UTC)What I'm trying to ask, with my previous comment, is: are you opposed in general to user-curated news sites, or more generally to any algorithm designed to show people the subset of information that they want to see?
Yes, people could often do with significantly more exposure to things that make them uncomfortable, and to social justice topics in particular. We need many more people working to fix such issues, and awareness of such issues helps in fixing them. However, people don't tend to *want* to be made uncomfortable, and I don't think it's reasonable to expect that every news site, social media site, or other source of information should do so.
no subject
Date: 2014-10-30 04:32 pm (UTC)no subject
Date: 2014-10-30 05:01 pm (UTC)Short of completely manual curation by a small group of people (selected for a range of view points), which can lead to its own forms of bias in addition to scaling poorly, it's not at all obvious how you could provide curation and filtration *without* that property.
"content that challenges the existing narrative" includes the types of social justice stories you're pushing for, but it also includes things like anti-FOSS screeds, pro-software-patent stories, or the latest on TempleOS from its...interesting...author. I'd expect all three of those to be both upvoted and flagged on HN too, even though all three of them are technology-related. (And I don't want to read any of those three.)
For that matter, one common failing of news sites (common on sites that intentionally try to show everything that people find interesting without any equivalent of "flagging"; also common in TV news): automatically giving all viewpoints equal time. To use social justice as an example: whenever people post hateful or anti-social-justice content on HN (which does happen, both as stories and especially as comments), it tends to get flagged off the site incredibly fast, and often the poster ends up banned. I've certainly had rather positive results flagging such content myself.
It isn't the job of *every* news site to show a selection of stories that intentionally includes every viewpoint. Some news sites are specifically designed to show a subset of stories. News sites that do a worse job of showing people what they want to read get replaced by sites that do a better job of showing people what they want to read. The set of people who actively want to be shown stories that challenge them visit other news sites for that content, and I would *hope* that many people get their news from multiple sites. I certainly do.
Out of curiosity, what news sites would you suggest that provide a primarily technology focus but include more of the kind of content you want to read? (I don't mean sites whose specific purpose is to include such content, but rather, sites that include high-quality examples of such content alongside various other high-quality content.) I'm always interested in finding better news sites.
no subject
Date: 2014-10-30 05:18 pm (UTC)no subject
Date: 2014-10-30 04:06 pm (UTC)Preprocessing:
wget http://www.codon.org.uk/~mjg59/hn_data/{flagged,unflagged}
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' flagged > flagged-nourl
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' unflagged > unflagged-nourl
grep -o '[A-Za-z]*' unflagged-nourl | tr A-Z a-z | sort | uniq -c | sort > unflagged-words
grep -o '[A-Za-z]*' flagged-nourl | tr A-Z a-z | sort | uniq -c | sort > flagged-words
# Note: Should probably filter out the top N English words
Python:
def parse(n):
d = {}
for line in file(n):
c, w = line.split()
d[w] = int(c)
return d
unflagged = parse("unflagged-words")
flagged = parse("flagged-words")
# Ratio of flagged to unflagged:
ratios = dict([(w,flagged[w]/unflagged[w]) for w in set(flagged) & set(unflagged)])
# Sorted by ratio:
l = list(sorted([(r,w) for (w,r) in ratios.iteritems()]))
# Top 50 words with highest flagged/unflagged ratio:
tesla 15.0
workers 12.0
understanding 11.0
netflix 9.0
name 9.0
calls 9.0
natural 8.0
months 8.0
hour 8.0
top 7.0
sell 7.0
scaling 7.0
runs 7.0
products 7.0
market 7.0
links 7.0
ipad 7.0
chat 7.0
adds 7.0
vulnerability 6.0
streaming 6.0
stock 6.0
scale 6.0
rethinkdb 6.0
nyc 6.0
low 6.0
leak 6.0
her 6.0
emails 6.0
comcast 6.0
clojurescript 6.0
cash 6.0
list 5.5
vr 5.0
very 5.0
valuation 5.0
turned 5.0
spy 5.0
shut 5.0
september 5.0
runtime 5.0
recognition 5.0
quality 5.0
prison 5.0
path 5.0
opens 5.0
needs 5.0
nd 5.0
n 5.0
middle 5.0
# Top 50 words with lowest flagged/unflagged ratio:
reality 0.125
paper 0.142857142857
reverse 0.142857142857
simulator 0.142857142857
spreadsheet 0.142857142857
architecture 0.166666666667
crash 0.166666666667
extreme 0.166666666667
jpmorgan 0.166666666667
lock 0.166666666667
mozilla 0.166666666667
simply 0.166666666667
mit 0.181818181818
advertising 0.2
bring 0.2
doctors 0.2
earn 0.2
example 0.2
execution 0.2
forever 0.2
hit 0.2
ii 0.2
improve 0.2
manual 0.2
others 0.2
parallel 0.2
pointer 0.2
printed 0.2
scottish 0.2
taking 0.2
take 0.214285714286
aws 0.222222222222
ever 0.222222222222
face 0.230769230769
emacs 0.235294117647
airport 0.25
algorithm 0.25
anonymous 0.25
asked 0.25
boy 0.25
bugs 0.25
campaign 0.25
comparing 0.25
crowd 0.25
died 0.25
dsl 0.25
economic 0.25
effort 0.25
encrypted 0.25
error 0.25
# Top 50 words that only appear in flagged stories, by number of flagged stories:
generator 10
dns 9
bill 9
aims 8
table 6
storm 6
stealth 6
moto 6
mini 6
latest 6
customer 6
whisper 5
rd 5
pricing 5
plugin 5
physical 5
ipo 5
introduces 5
epic 5
ello 5
currency 5
cto 5
winners 4
watson 4
versioning 4
unlimited 4
trust 4
trouble 4
takedown 4
split 4
sized 4
side 4
session 4
seize 4
scotland 4
rule 4
returns 4
rethinking 4
restart 4
researcher 4
reports 4
policies 4
operation 4
officially 4
mark 4
managing 4
leaks 4
jquery 4
irc 4
handle 4
no subject
Date: 2014-10-30 04:33 pm (UTC)Meritocracy
Date: 2014-10-30 07:54 pm (UTC)It seems grounded in logic to me. The people who are objectively best at their job are the most qualified to make executive decisions pertaining to it. Therefore the best concrete results will be obtained by promoting those people. It seems very practical and inclusive to me, since everybody gets a fair chance at demonstrating their best work.
Re: Meritocracy
Date: 2014-10-30 08:08 pm (UTC)Also see https://news.ycombinator.com/item?id=8534078 for another issue with focusing only on the highlights of an entire community.
Re: Meritocracy
Date: 2014-11-04 08:30 pm (UTC)Spining future: if we assume we all have other optimal conditions who would you pick and change conditions for to let the person shine knowing that you may change conditions for otherd so they not can shine either anymore or not more like before?
Serious, the only answer is trying to improve conditions for anybody and look who shines most what givesnyou option 1, Meritocracy.
Re: Meritocracy
Date: 2014-10-31 02:16 am (UTC)Those who have been identified to be best at their job are not necessarily the most qualified to make executive decisions pertaining to that. Bias, conscious or unconscious, is a trap that even the best of us can fall into.
If you want to get into some of the theory around meritocracy, Nature had an article earlier this year that attempts to model a true meritocracy. The online version is here.
It's also worth noting that the people who are most likely to believe in meritocracy are young, upper-class, white men. The people who are least likely to believe in meritocracy are older, lower-class, minorities. I couldn't find the full article online; it's abstract is here.
Re: Meritocracy
Date: 2014-11-04 08:37 pm (UTC)What do you suggest as better alternate?
no subject
Date: 2014-10-30 08:04 pm (UTC)no subject
Date: 2014-10-30 08:14 pm (UTC)It also seems likely that such articles would trigger HN's flamewar detector, as well, which has the same effect as flagging.
no subject
Date: 2014-11-01 12:33 am (UTC)This is the argument to moderation, a popular logical fallacy in the cable news industry. This is the problem with HN; it positions itself as "disrupting" things, of doing these radical improvements using technology, but in reality it is a gigantic performance art piece about capitulating to entrenched power structures.
'Hackers' don't have to be your audience. 'Hackers' are over
Date: 2014-11-04 10:13 pm (UTC)Everyone: Please spread this extremely important message to all your contacts in the media. If many websites that carry hacker news publish articles on it during the same 24 hours victory is assured.
No? You don’t want to ‘be divisive?’ Who’s being divided, except for people who are okay with an infantilized cultural desert of shitty behavior and people who aren’t? What is there to ‘debate’?