Subscribe:

Monday 28 January 2013

Facebook Writes New Language To Fight Spam




On Thursday, Software Engineer Louis Brandy revealed that the engineering staff at Facebook have created a new programming language to combat spam.

The programming language, called Feature eXtraction Language, or FXL for short, is “a domain-specific language forged in the fires of spam fighting at Facebook”.

Brandy admits that writing a language from scratch is not always the most effective approach to fighting spam. FXL, however, “is not a novel language”.

In reality, FXL is “a narrowly-optimized implementation of a well-chosen subset of Standard ML (with some customized syntax)”, modified to tackle the ever-evolving spam that bombards Facebook’s servers on a daily basis.

FXL was designed to cope with spam threats which change “on a daily, or even hourly, basis”, meaning it is “easy to write”, but still capable of sifting through the enormous amounts of data on Facebook.

To display FXLs superiority as a Facebook spam fighter, Brandy gives a few examples of “contrived spam fighting rules, expressed in FXL, for catching dangerous URLs:”
If (Reputation(SharedUrl) < 0) Then [LogRequest] Else []
If (Reputation(SharedUrl) == MALWARE) Then [BlockAction, LogRequest] Else []
If (Average(Map(Reputation, PreviousSharedUrls(User, 5))) < 0) Then [WarnUser, LogRequest] Else []
Brandy then explains the problem:
These rules retrieve the user’s URL sharing history and fetch data from a URL reputation service. While they coherently express business logic for detecting spam, these rules are poor expressions of the optimal data fetching logic. A conventional implementation would evaluate this code top to bottom, left to right. We would fetch data sequentially, conducting an excessive number of network round trips between the machine executing FXL and the reputation service. This is a classic problem of large computer systems: naively mixing business logic with data fetching logic, resulting in pathologically bad performance. A more sophisticated approach would find a way to batch these data fetches in a single network round trip. FXL was designed to do precisely this and automate these data fetches.
Because FXL is “a "pure" language with no side effects” – i.e. it does not update the data it fetches – there are a number of shortcuts that can be made when combing through data:
All features and functions can be safely memoized...
  1.  F(X) will always be Y, no matter how many times we compute it
  2.  Random() is not pure, therefore not memoizable (and not allowed inFXL) 
or executed lazily... 
  1.  "False && F(X)" can safely skip F(X)
  2.  "If False Then True Else F(X)"  as well 
or safely reordered. 
  1.  "G(F(x), F(y), ...)" will give the same result, no matter which F is executed first.
  2.  "A(x) + B(x) + C(x)" as well”
Brandy then takes a snippet of the code from before to use as an example:
Map(Reputation, PreviousSharedUrls(User, 5))
It would take another URL reputation service up to five attempts to finish this snippet, FXL can do all five simultaneously. “This is not a special property of the Map() function” affirms Brandy, “as this optimization is performed across all expressions of all rules.”
FXL is able to batch requests together because the order in which it evaluates these function calls has no bearing on their results (this follows from their lack of side effects). FXL will actually halt the execution of one function, begin executing a second function, and only later return to complete executing the first function.
In the call to Map() above, FXL makes five calls to Reputation(). FXL begins executing the first call to Reputation(), then halts its execution at the point it would need to fetch data from the URL reputation service. FXL then begins executing the second call to Reputation(), halting again before fetching any data.
FXL repeats this begin-and-halt procedure on the third, fourth, and fifth calls to Reputation() as well. At this point, no functions remain which have not been partially executed. No function can proceed without fetching data, so FXL fetches all the data needed by these functions in a single batch. Having obtained the URL reputation data, it can resume execution of all five calls to Reputation().
Consequentially, FXL’s batched fetching is twenty times faster that a “a naïve execution model that fetches data eagerly.”

The Facebook engineering team also uses memoization to speed up FXL’s process: “all common FXL expressions” are memoized, so that every repeated expression is executed only once.

Despite the frequent spam-related complaints from Facebook users, the amount of data fetching FXL performs compared to what little spam gets through suggests that the language performs its function admirably.

Have you ever encountered spam on Facebook? Do you think FXL performs its task well?

0 comments:

Post a Comment