mjg59 | Hacker News metrics (first rough approach)

I'm not a huge fan of Hacker News[1]. My impression continues to be that it ends up promoting stories that align with the Silicon Valley narrative of meritocracy, technology will fix everything, regulation is the cancer killing agile startups, and discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves.

But as a good data-driven person[2], wouldn't it be nice to have numbers rather than just handwaving? In the absence of a good public dataset, I scraped Hacker Slide to get just over two months of data in the form of hourly snapshots of stories, their age, their score and their position. I then applied a trivial test:

If the story is younger than any other story
and the story has a higher score than that other story
and the story has a worse ranking than that other story
and at least one of these two stories is on the front page

then the story is considered to have been penalised.

(note: "penalised" can have several meanings. It may be due to explicit flagging, or it may be due to an automated system deciding that the story is controversial or appears to be supported by a voting ring. There may be other reasons. I haven't attempted to separate them, because for my purposes it doesn't matter. The algorithm is discussed here.)

Now, ideally I'd classify my dataset based on manual analysis and classification of stories, but I'm lazy (see [2]) and so just tried some keyword analysis:

Keyword	Penalised	Unpenalised
Women	13	4
Harass	2	0
Female	5	1
Intel	2	3
x86	3	4
ARM	3	4
Airplane	1	2
Startup	46	26

A few things to note:

Lots of stories are penalised. Of the front page stories in my dataset, I count 3240 stories that have some kind of penalty applied, against 2848 that don't. The default seems to be that some kind of detection will kick in.
Stories containing keywords that suggest they refer to issues around social justice appear more likely to be penalised than stories that refer to technical matters
There are other topics that are also disproportionately likely to be penalised. That's interesting, but not really relevant - I'm not necessarily arguing that social issues are penalised out of an active desire to make them go away, merely that the existing ranking system tends to result in it happening anyway.

This clearly isn't an especially rigorous analysis, and in future I hope to do a better job. But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues. An interesting next step would be to attempt to infer whether the reasons for the penalties are similar between different categories of penalised stories[3], but I'm not sure how practical that is with the publicly available data.

(Raw data is here, penalised stories are here, unpenalised stories are here)

[1] Moving to San Francisco has resulted in it making more sense, but really that just makes me even more depressed.
[2] Ha ha like fuck my PhD's in biology
[3] Perhaps stories about startups tend to get penalised because of voter ring detection from people trying to promote their startup, while stories about social issues tend to get penalised because of controversy detection?

Flat | Top-Level Comments Only

From: (Anonymous)

Hey Matthew,

You're doing god's work. You may find https://github.com/sytelus/HackerNewsData a better source of data

From:

mjg59

Hm. It looks like that dataset doesn't provide snapshots - I need to be able to look at a story's ranking over time in order to infer whether penalties have been applied.

From: (Anonymous)

You say that "the Hacker News ranking algorithm tends to penalise stories that address social issues", wouldn't it be more fair to say that "Hacker News' users and/or moderators tend to penalize stories that address social issues"?

From:

mjg59

Manual moderation is only one of the inputs into the scoring. There are at least two automated systems that can also apply penalties. I'm not currently able to distinguish between any of these things.

From: (Anonymous)

I would tend to suggest that stories that refer to *anything* other than technical matters are more likely to be penalized than stories that refer to technical matters. That doesn't seem drastically different than, for instance, penalizing stories about social justice posted to StackOverflow, or to a technical mailing list, or to /r/mylittlepony/. (And conversely, I'd expect stories primarily about hacking to get penalized on a forum about social justice issues or feminism, unless they're specifically about social justice / feminist aspects of hacking.)

You said: "But for now the evidence appears consistent with my innate prejudice - the Hacker News ranking algorithm tends to penalise stories that address social issues." However, the conclusions you're drawing seem to suggest that your innate prejudice is that such stories are *disproportionately* penalized compared to other types of off-topic stories. I'd suggest the alternate hypothesis that there's a rough set of on-topic topics that tend to make the cut, and most other things get penalized.

Based on analysis of the data you posted, I don't see anything obvious to support your conclusion that social justice issues are drastically *more* likely to get flagged than other off-topic stories. I'd certainly agree that they get flagged right along with piles of other off-topic stories, and apparently a very large fraction of *on-topic* stories.

In other words: HN's algorithm tends to not show its readers stories they don't want to see, and tends to show its readers stories they do want to see. You could reasonably argue about how that creates an echo chamber; see also DuckDuckGo's various discussions of "filter bubbles".

In short, many hackers are not necessarily interested in hearing about "discouraging stories that suggest that the world of technology is, broadly speaking, awful and we should all be ashamed of ourselves" in every source of information that they consume. Personally, I don't watch the TV news for vaguely similar reasons. I simply don't have the spoons to deal with it 100% of the time.

From:

mjg59

Your assertion that these stories are off-topic isn't supported by the guidelines.

From: (Anonymous)

The guidelines list some samples of what's on-topic and off-topic; they're not an exhaustive list, as suggested by "guidelines". In general, what's on-topic and off-topic is defined precisely by what gets promoted and what gets flagged, like almost any user-curated forum.

What I'm trying to ask, with my previous comment, is: are you opposed in general to user-curated news sites, or more generally to any algorithm designed to show people the subset of information that they want to see?

Yes, people could often do with significantly more exposure to things that make them uncomfortable, and to social justice topics in particular. We need many more people working to fix such issues, and awareness of such issues helps in fixing them. However, people don't tend to *want* to be made uncomfortable, and I don't think it's reasonable to expect that every news site, social media site, or other source of information should do so.

From:

mjg59

I'm opposed to sites that are designed in such a way that it's easy to hide high-quality content that challenges the existing narrative.

From: (Anonymous)

The primary job of a news site is to filter a massive amount of content down into a quantity that people can usefully consume (to varying degrees; for instance, providing a short enough number of headlines to skim and subsequently read a subset of).

Short of completely manual curation by a small group of people (selected for a range of view points), which can lead to its own forms of bias in addition to scaling poorly, it's not at all obvious how you could provide curation and filtration *without* that property.

"content that challenges the existing narrative" includes the types of social justice stories you're pushing for, but it also includes things like anti-FOSS screeds, pro-software-patent stories, or the latest on TempleOS from its...interesting...author. I'd expect all three of those to be both upvoted and flagged on HN too, even though all three of them are technology-related. (And I don't want to read any of those three.)

For that matter, one common failing of news sites (common on sites that intentionally try to show everything that people find interesting without any equivalent of "flagging"; also common in TV news): automatically giving all viewpoints equal time. To use social justice as an example: whenever people post hateful or anti-social-justice content on HN (which does happen, both as stories and especially as comments), it tends to get flagged off the site incredibly fast, and often the poster ends up banned. I've certainly had rather positive results flagging such content myself.

It isn't the job of *every* news site to show a selection of stories that intentionally includes every viewpoint. Some news sites are specifically designed to show a subset of stories. News sites that do a worse job of showing people what they want to read get replaced by sites that do a better job of showing people what they want to read. The set of people who actively want to be shown stories that challenge them visit other news sites for that content, and I would *hope* that many people get their news from multiple sites. I certainly do.

Out of curiosity, what news sites would you suggest that provide a primarily technology focus but include more of the kind of content you want to read? (I don't mean sites whose specific purpose is to include such content, but rather, sites that include high-quality examples of such content alongside various other high-quality content.) I'm always interested in finding better news sites.

From: (Anonymous)

A clarification, by the way: I personally *want* to see high-quality stories about social justice in technology. I get most of that content from Twitter, rather than HN. I simply don't expect to get that portion of my news from HN, any more than I expect to get news about interesting programming languages and operating systems from the Wall Street Journal.

From: (Anonymous)

Reposting this further analysis here from Twitter:

Preprocessing:

wget http://www.codon.org.uk/~mjg59/hn_data/{flagged,unflagged}
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' flagged > flagged-nourl
sed 's/ (https\?:.*)$//g;s/ (item?id.*)$//g' unflagged > unflagged-nourl
grep -o '[A-Za-z]*' unflagged-nourl | tr A-Z a-z | sort | uniq -c | sort > unflagged-words
grep -o '[A-Za-z]*' flagged-nourl | tr A-Z a-z | sort | uniq -c | sort > flagged-words

# Note: Should probably filter out the top N English words

Python:

def parse(n):
d = {}
for line in file(n):
c, w = line.split()
d[w] = int(c)
return d

unflagged = parse("unflagged-words")
flagged = parse("flagged-words")

# Ratio of flagged to unflagged:
ratios = dict([(w,flagged[w]/unflagged[w]) for w in set(flagged) & set(unflagged)])

# Sorted by ratio:
l = list(sorted([(r,w) for (w,r) in ratios.iteritems()]))

# Top 50 words with highest flagged/unflagged ratio:
tesla 15.0
workers 12.0
understanding 11.0
netflix 9.0
name 9.0
calls 9.0
natural 8.0
months 8.0
hour 8.0
top 7.0
sell 7.0
scaling 7.0
runs 7.0
products 7.0
market 7.0
links 7.0
ipad 7.0
chat 7.0
adds 7.0
vulnerability 6.0
streaming 6.0
stock 6.0
scale 6.0
rethinkdb 6.0
nyc 6.0
low 6.0
leak 6.0
her 6.0
emails 6.0
comcast 6.0
clojurescript 6.0
cash 6.0
list 5.5
vr 5.0
very 5.0
valuation 5.0
turned 5.0
spy 5.0
shut 5.0
september 5.0
runtime 5.0
recognition 5.0
quality 5.0
prison 5.0
path 5.0
opens 5.0
needs 5.0
nd 5.0
n 5.0
middle 5.0

# Top 50 words with lowest flagged/unflagged ratio:
reality 0.125
paper 0.142857142857
reverse 0.142857142857
simulator 0.142857142857
spreadsheet 0.142857142857
architecture 0.166666666667
crash 0.166666666667
extreme 0.166666666667
jpmorgan 0.166666666667
lock 0.166666666667
mozilla 0.166666666667
simply 0.166666666667
mit 0.181818181818
advertising 0.2
bring 0.2
doctors 0.2
earn 0.2
example 0.2
execution 0.2
forever 0.2
hit 0.2
ii 0.2
improve 0.2
manual 0.2
others 0.2
parallel 0.2
pointer 0.2
printed 0.2
scottish 0.2
taking 0.2
take 0.214285714286
aws 0.222222222222
ever 0.222222222222
face 0.230769230769
emacs 0.235294117647
airport 0.25
algorithm 0.25
anonymous 0.25
asked 0.25
boy 0.25
bugs 0.25
campaign 0.25
comparing 0.25
crowd 0.25
died 0.25
dsl 0.25
economic 0.25
effort 0.25
encrypted 0.25
error 0.25

# Top 50 words that only appear in flagged stories, by number of flagged stories:
generator 10
dns 9
bill 9
aims 8
table 6
storm 6
stealth 6
moto 6
mini 6
latest 6
customer 6
whisper 5
rd 5
pricing 5
plugin 5
physical 5
ipo 5
introduces 5
epic 5
ello 5
currency 5
cto 5
winners 4
watson 4
versioning 4
unlimited 4
trust 4
trouble 4
takedown 4
split 4
sized 4
side 4
session 4
seize 4
scotland 4
rule 4
returns 4
rethinking 4
restart 4
researcher 4
reports 4
policies 4
operation 4
officially 4
mark 4
managing 4
leaks 4
jquery 4
irc 4
handle 4

From: (Anonymous)

Better yet, see the analysis at http://adambernard.net/tmp/differences.tab

From: (Anonymous)

What's wrong with meritocracy?
It seems grounded in logic to me. The people who are objectively best at their job are the most qualified to make executive decisions pertaining to it. Therefore the best concrete results will be obtained by promoting those people. It seems very practical and inclusive to me, since everybody gets a fair chance at demonstrating their best work.

From: (Anonymous)

Leaving aside whether the statistics posted in this article are an issue of meritocracy, the problems with meritocracy itself are quite well documented. Do you want to select for "best" or "would turn out best under optimal conditions"? Meritocracy typically selects for the former, not the latter; it highlights those who shine, and ignores those who have not had an opportunity to shine.

Also see https://news.ycombinator.com/item?id=8534078 for another issue with focusing only on the highlights of an entire community.

From: (Anonymous)

Well, how could you select the "best under optimal conditions" if they not had "an opportunity to shine" and not be wrong? Since everybody you not selected could turn out later to be the better pick, no? Also who says what that optimal conditions are where a certain person will shine more then others if they are granted there best optimal conditions?

Spining future: if we assume we all have other optimal conditions who would you pick and change conditions for to let the person shine knowing that you may change conditions for otherd so they not can shine either anymore or not more like before?

Serious, the only answer is trying to improve conditions for anybody and look who shines most what givesnyou option 1, Meritocracy.

From:

nadyne

Meritocracy assumes that there is such a thing as "objectively best at their job". "Best at their job" is subjective, and there are many different metrics that could be a part of that. You also make the assertion that "everybody gets a fair chance at demonstrating their best work", which is an interesting one. If we assume that your assertion is true (which I'm frankly not convinced by, and I hope you can provide some citations for this assertion), the next step is critical: how do we know that everyone who has demonstrated their best work is getting evaluated fairly?

Those who have been identified to be best at their job are not necessarily the most qualified to make executive decisions pertaining to that. Bias, conscious or unconscious, is a trap that even the best of us can fall into.

If you want to get into some of the theory around meritocracy, Nature had an article earlier this year that attempts to model a true meritocracy. The online version is here.

It's also worth noting that the people who are most likely to believe in meritocracy are young, upper-class, white men. The people who are least likely to believe in meritocracy are older, lower-class, minorities. I couldn't find the full article online; it's abstract is here.

From: (Anonymous)

I think nobody assumes meritocracy to be perfect. Its run and executed by humans so for sure its biased, not absolute objective, has flaws, etc. etc. Like with democracy, it may not optimal but other options are neither.

What do you suggest as better alternate?

From:

m50d.wordpress.com

I flag these stories whenever they come up because I know from experience that the resulting threads tend to be unproductive flamefests.

From: (Anonymous)

That's an interesting point of view. The comments on such articles tend to contain a disproportionately higher number of the kinds of trollish comments (and concern trolls, and subtle trolls, and so on) that need downvoting and flagging into oblivion. Personally, I think that's an argument for keeping them, and banning the trolls.

It also seems likely that such articles would trigger HN's flamewar detector, as well, which has the same effect as flagging.

From:

glyf

In other words: by perpetuating an unrelenting campaign of harassment and sea-lioning, reactionary defenders of the status quo have successfully convinced you that these topics are "controversial" rather than convincing you that they're assholes.

This is the argument to moderation, a popular logical fallacy in the cable news industry. This is the problem with HN; it positions itself as "disrupting" things, of doing these radical improvements using technology, but in reality it is a gigantic performance art piece about capitulating to entrenched power structures.

From: (Anonymous)

On the evidence of the last few weeks, what we are seeing is the end of hackers, and the viciousness that accompanies the death of an identity.

Everyone: Please spread this extremely important message to all your contacts in the media. If many websites that carry hacker news publish articles on it during the same 24 hours victory is assured.

No? You don’t want to ‘be divisive?’ Who’s being divided, except for people who are okay with an infantilized cultural desert of shitty behavior and people who aren’t? What is there to ‘debate’?

Flat | Top-Level Comments Only

Profile

Matthew Garrett

About Matthew

Power management, mobile and firmware developer on Linux. Security developer at nvidia. Ex-biologist. Content here should not be interpreted as the opinion of my employer. Also on Mastodon and Bluesky.

Page Summary

(Anonymous) - Maybe better data
(Anonymous) - (no subject)
(Anonymous) - (no subject)
(Anonymous) - (no subject)
(Anonymous) - Meritocracy
m50d.wordpress.com - (no subject)
(Anonymous) - 'Hackers' don't have to be your audience. 'Hackers' are over

Expand Cut Tags

No cut tags

Matthew Garrett

Hacker News metrics (first rough approach)

Hacker News metrics (first rough approach)

Maybe better data

Re: Maybe better data

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Meritocracy

Re: Meritocracy

Re: Meritocracy

Re: Meritocracy

Re: Meritocracy

no subject

no subject

no subject

'Hackers' don't have to be your audience. 'Hackers' are over

Profile

About Matthew

Page Summary

Expand Cut Tags