![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I'm not a huge fan of Hacker News[1]. My impression continues to be that it ends up promoting stories that align with the Silicon Valley narrative of meritocracy, technology will fix everything, regulation is the cancer killing agile startups, and discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves.
But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:
(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)
Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:
A few things to note:
This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.
(Raw data is here, penalised stories are here, unpenalised stories are here)
[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?
But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:
- If the story is younger than any other story
- and the story has a higher score than that other story
- and the story has a worse ranking than that other story
- and at least one of these two stories is on the front page
(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)
Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:
Keyword | Penalised | Unpenalised |
Women | 13 | 4 |
Harass | 2 | 0 |
Female | 5 | 1 |
Intel | 2 | 3 |
x86 | 3 | 4 |
ARM | 3 | 4 |
Airplane | 1 | 2 |
Startup | 46 | 26 |
A few things to note:
- Lots of stories are penalised. Of the front page stories in my dataset, I count 3240 stories that have some kind of penalty applied, against 2848 that don't. The default seems to be that some kind of detection will kick in.
- Stories containing keywords that suggest they refer to issues around social justice appear more likely to be penalised than stories that refer to technical matters
- There are other topics that are also disproportionately likely to be penalised. That's interesting, but not really relevant - I'm not necessarily arguing that social issues are penalised out of an active desire to make them go away, merely that the existing ranking system tends to result in it happening anyway.
This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.
(Raw data is here, penalised stories are here, unpenalised stories are here)
[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?
Maybe better data
Date: 2014-10-30 03:23 pm (UTC)You're doing god's work. You may find https://github.com/sytelus/HackerNewsData a better source of data
Re: Maybe better data
From:no subject
Date: 2014-10-30 03:24 pm (UTC)(no subject)
From:no subject
Date: 2014-10-30 04:00 pm (UTC)You said: "But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues." However, the conclusions you're drawing seem to suggest that your innate prejudice is that such stories are *disproportionately* penalized compared to other types of off-topic stories. I'd suggest the alternate hypothesis that there's a rough set of on-topic topics that tend to make the cut, and most other things get penalized.
Based on analysis of the data you posted, I don't see anything obvious to support your conclusion that social justice issues are drastically *more* likely to get flagged than other off-topic stories. I'd certainly agree that they get flagged right along with piles of other off-topic stories, and apparently a very large fraction of *on-topic* stories.
In other words: HN's algorithm tends to not show its readers stories they don't want to see, and tends to show its readers stories they do want to see. You could reasonably argue about how that creates an echo chamber; see also DuckDuckGo's various discussions of "filter bubbles".
In short, many hackers are not necessarily interested in hearing about "discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves" in every source of information that they consume. Personally, I don't watch the TV news for vaguely similar reasons. I simply don't have the spoons to deal with it 100% of the time.
(no subject)
From:(no subject)
From: (Anonymous) - Date: 2014-10-30 04:26 pm (UTC) - Expand(no subject)
From:(no subject)
From: (Anonymous) - Date: 2014-10-30 05:01 pm (UTC) - Expand(no subject)
From: (Anonymous) - Date: 2014-10-30 05:18 pm (UTC) - Expandno subject
Date: 2014-10-30 04:06 pm (UTC)Preprocessing:
wget http://www.codon.org.uk/~mjg59/hn_data/{flagged,unflagged}
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' flagged > flagged-nourl
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' unflagged > unflagged-nourl
grep -o '[A-Za-z]*' unflagged-nourl | tr A-Z a-z | sort | uniq -c | sort > unflagged-words
grep -o '[A-Za-z]*' flagged-nourl | tr A-Z a-z | sort | uniq -c | sort > flagged-words
# Note: Should probably filter out the top N English words
Python:
def parse(n):
d = {}
for line in file(n):
c, w = line.split()
d[w] = int(c)
return d
unflagged = parse("unflagged-words")
flagged = parse("flagged-words")
# Ratio of flagged to unflagged:
ratios = dict([(w,flagged[w]/unflagged[w]) for w in set(flagged) & set(unflagged)])
# Sorted by ratio:
l = list(sorted([(r,w) for (w,r) in ratios.iteritems()]))
# Top 50 words with highest flagged/unflagged ratio:
tesla 15.0
workers 12.0
understanding 11.0
netflix 9.0
name 9.0
calls 9.0
natural 8.0
months 8.0
hour 8.0
top 7.0
sell 7.0
scaling 7.0
runs 7.0
products 7.0
market 7.0
links 7.0
ipad 7.0
chat 7.0
adds 7.0
vulnerability 6.0
streaming 6.0
stock 6.0
scale 6.0
rethinkdb 6.0
nyc 6.0
low 6.0
leak 6.0
her 6.0
emails 6.0
comcast 6.0
clojurescript 6.0
cash 6.0
list 5.5
vr 5.0
very 5.0
valuation 5.0
turned 5.0
spy 5.0
shut 5.0
september 5.0
runtime 5.0
recognition 5.0
quality 5.0
prison 5.0
path 5.0
opens 5.0
needs 5.0
nd 5.0
n 5.0
middle 5.0
# Top 50 words with lowest flagged/unflagged ratio:
reality 0.125
paper 0.142857142857
reverse 0.142857142857
simulator 0.142857142857
spreadsheet 0.142857142857
architecture 0.166666666667
crash 0.166666666667
extreme 0.166666666667
jpmorgan 0.166666666667
lock 0.166666666667
mozilla 0.166666666667
simply 0.166666666667
mit 0.181818181818
advertising 0.2
bring 0.2
doctors 0.2
earn 0.2
example 0.2
execution 0.2
forever 0.2
hit 0.2
ii 0.2
improve 0.2
manual 0.2
others 0.2
parallel 0.2
pointer 0.2
printed 0.2
scottish 0.2
taking 0.2
take 0.214285714286
aws 0.222222222222
ever 0.222222222222
face 0.230769230769
emacs 0.235294117647
airport 0.25
algorithm 0.25
anonymous 0.25
asked 0.25
boy 0.25
bugs 0.25
campaign 0.25
comparing 0.25
crowd 0.25
died 0.25
dsl 0.25
economic 0.25
effort 0.25
encrypted 0.25
error 0.25
# Top 50 words that only appear in flagged stories, by number of flagged stories:
generator 10
dns 9
bill 9
aims 8
table 6
storm 6
stealth 6
moto 6
mini 6
latest 6
customer 6
whisper 5
rd 5
pricing 5
plugin 5
physical 5
ipo 5
introduces 5
epic 5
ello 5
currency 5
cto 5
winners 4
watson 4
versioning 4
unlimited 4
trust 4
trouble 4
takedown 4
split 4
sized 4
side 4
session 4
seize 4
scotland 4
rule 4
returns 4
rethinking 4
restart 4
researcher 4
reports 4
policies 4
operation 4
officially 4
mark 4
managing 4
leaks 4
jquery 4
irc 4
handle 4
(no subject)
From: (Anonymous) - Date: 2014-10-30 04:33 pm (UTC) - ExpandMeritocracy
Date: 2014-10-30 07:54 pm (UTC)It seems grounded in logic to me. The people who are objectively best at their job are the most qualified to make executive decisions pertaining to it. Therefore the best concrete results will be obtained by promoting those people. It seems very practical and inclusive to me, since everybody gets a fair chance at demonstrating their best work.
Re: Meritocracy
From: (Anonymous) - Date: 2014-10-30 08:08 pm (UTC) - ExpandRe: Meritocracy
From: (Anonymous) - Date: 2014-11-04 08:30 pm (UTC) - ExpandRe: Meritocracy
From:Re: Meritocracy
From: (Anonymous) - Date: 2014-11-04 08:37 pm (UTC) - Expandno subject
Date: 2014-10-30 08:04 pm (UTC)(no subject)
From: (Anonymous) - Date: 2014-10-30 08:14 pm (UTC) - Expand(no subject)
From:'Hackers' don't have to be your audience. 'Hackers' are over
Date: 2014-11-04 10:13 pm (UTC)Everyone: Please spread this extremely important message to all your contacts in the media. If many websites that carry hacker news publish articles on it during the same 24 hours victory is assured.
No? You don’t want to ‘be divisive?’ Who’s being divided, except for people who are okay with an infantilized cultural desert of shitty behavior and people who aren’t? What is there to ‘debate’?