Privacy in ratings aggregation

Main article: Privacy, identity, and fraud in the ratings system

Let’s say our aggregation technique is a straight average. This is a subject we covered before and decided that the privacy-enhanced version of it would include each node only passing on summed values to the next node up. That way, the final node that performed the aggregation (and presumably reported it) would have no knowledge of the individual components making up the average. But this may be scant comfort to our privacy-concerned sources. After all, they had to relay their rating to someone and must then trust that someone to keep it secret. We presume that the next higher node will also have a privacy concern and thus act in good faith to keep every incoming value secret, but you never know.

Before we begin looking at this, let’s make clear that for some cases, privacy is impossible while adhering to a completely accurate aggregate average (we will come to this point again when we discuss differential privacy). But for now, let’s suppose I ask 4 of my peers to rate my financial integrity. Since I’ve never really committed any acts of financial crime, and they don’t know my financial history all that well anyway, they all give me a 10 and call it a day. The average is therefore a 10, the highest possible. But now I know what everyone said since I received the highest possible score. Maybe none of my peers cares about the fact that I know the score they gave me in this instance. But now let’s suppose I tried to get a job as a financial accountant and get a rating of 0 from everyone on my knowledge of that subject. Again, I know what everyone said and it certainly hurts my chances of getting the job. My peers, who did nothing more than rate me accurately, may not want this fact known. For these extreme situations, privacy can never be guaranteed, even when only the aggregate score is reported.

Obviously, in situations like these, the rater may very well know the sample size (or even the other raters) and will certainly know the ratings question. And so, they can choose to abstain because it is clear that the aggregate will lead to an extreme answer (like 0). One way to avoid this is to not report the actual number but a range instead: 0-3 Bad, 4-6 Average, 7-8 Good, 9-10 Excellent. The ratee would still have some information about his raters but it would make it more difficult to specify who said what. A related method would simply round the score to the nearest whole number.

It is unlikely that for a large sampling we will get extreme answers, of course. But it is another reason why we should resist the temptation to grade on a curve (like the Freedom House rating of Finland). When we do that we think of the best or worst example and give them the highest or lowest score. It’s wrong objectively and helps compromise the source.

Aggregate ratings at the high or low end also compromise the sources in that they betray some information about them. If I were to receive a score of 0.25 from 4 of my peers on financial accounting, I know that the highest possible score any one of them gave me was a 1 ((1+0+0+0)/4 = 0.25). I still know that each of them thinks my knowledge of the subject is bad. In fact, as the score tends toward the higher and lower end of the range, the probability is higher that the raters rated me toward the higher or lower end as well. Let’s skip the math on this for now but it seems clear that aggregate information can, in many instances, reveal at least some information about the source.

So let’s do the best we can. Let’s suppose there are 4 sources and I want to know their average opinion of me. None of the sources wants me to know what they think individually but they’re ok with me seeing the aggregate score. The best they can do is report their rating to the aggregator without revealing who they are but only that they are part of a valid group of raters.

So the group of raters establishes a shared key that they use to encrypt their rating with. Each member then anonymously sends their encrypted rating to a common location (ie a collection box). When all the ratings are there, a member of the group (who has the key) decrypts the ratings, adds them all up, counts up the number of raters, and reports these two facts to the aggregator. In a sense, the member is the aggregator but let’s assume we make this distinction because we don’t want the ratings organization to have the private key to the ratings collection box (because we don’t want them to see any individualized data whatsoever).

So it would appear that the raters have effectively kept their information hidden. Even if one of the raters is unfaithful and gives the key to someone, no one will know who is responsible for which rating. In the event that we are doing a check, however, it would be pretty easy to verify that the numbers in the box haven’t been altered by, say, one of the raters. We can simply report the numbers back to each rater and ask them to vouch for one of the numbers as their own. As long as all the numbers have been vouched for, we know that the aggregate is correct.

This method requires that all members share a private key. Even though we have the check noted above, it also requires that we trust one of the members to report the aggregate faithfully, not tamper with the data, etc.

We might be able to make this more foolproof by using Shamir’s Secret Sharing method. Here, instead of trusting just 1 person to decrypt the data, report correctly, etc. we require 2 people to do it. Any two of the four original raters will work.

This method makes use of polynomial interpolation. Let’s do a simple example to illustrate the concept and then refer to a more general method covered in the literature.

Suppose the secret key to the box is 5638 and we want any 2 of the 4 raters to be able to open it. We can encode the secret (5638) in a linear function:

${\textstyle y={a_{0}}+{a_{1}}x}$

Where ${\textstyle a_{0}=5638}$

Now we choose an ${\textstyle a_{1}}$ at random, so say

${\textstyle a_{1}=7}$

So our linear equation can be written as

${\textstyle y=5638+7x}$

Using this equation we can obtain any number of coordinate points (x, y) on the line. Let’s suppose we pick two, (13, 5729) and (26, 5820) and give one person in our group the first coordinate and another person the second coordinate. Now these two people can find the secret key by reconstructing the equation. We can, in the same way, give another 2 coordinate points to the two other members of the group. To do this it might help to recall that any line is more familiarly described as

${\textstyle y=mx+b}$

Where ${\textstyle m}$ is the slope of the line and ${\textstyle b}$ is the y-intercept. Thus

${\textstyle a_{1}=m}$ and ${\textstyle a_{0}=b}$

To find ${\textstyle m=a_{1}}$ (the slope) we simply compute the rise over the run using the two points:

${\textstyle m=a_{1}={{5820-5729} \over {26-13}}=7}$

To find ${\textstyle a_{0}}$ , the secret, we plug this slope into the formula above, along with one of the points,

${\textstyle 5820=a_{0}+7(26)}$

and solve for ${\textstyle a_{0}}$ :

${\textstyle a_{0}=5820-7(26)=5638}$ , which is the secret.

Thus any two members of the group can find the key, decrypt the data, and proceed to aggregate the numbers. However, only one member will not be able to reconstruct the key since he only has one point available to him and an infinite number of lines go through that point.

We can use this technique for any number of people by increasing the order of the polynomial. For a 2nd order polynomial (a parabola) we would need 3 people to have a different point on the parabola to decrypt the data. For a 3rd order polynomial, 4 people, and so on:

${\textstyle n=k+1}$

where

${\textstyle k}$ is the order of the polynomial

${\textstyle n}$ is the number of people required for decryption

Any number of people less than ${\textstyle n}$ will be unable to reconstruct the secret key. So for a 3rd polynomial, say, 4 people are needed and any less than that will very likely fail.

Polynomial interpolation for higher order polynomials isn’t as simple as for a straight line but straightforward, closed-form methods exist for any polynomial degree. These methods use Lagrange polynomials and are described at length here and here:

Furthermore, any standard mathematical library can be found to run these calculations, for instance the interpolation library in Python’s SciPy.