10 Questions with Facebook Research Engineer – Andrei Alexandrescu

10 Questions with Facebook Research Engineer – Andrei Alexandrescu

·

29 min read

This article was posted on serversidemagazine.com a defunct blog from 2011, the article below was moved here for posterity.

The article was published on January 29th, 2012.


Today we caught up with Andrei Alexandrescu for a “10 Question” interview. He is a Romanian-born research engineer at Facebook living in the US, you can contact him on his website erdani.com or @incomputable.

We will talk about some of the juicy stuff that going on at Facebook, so let’s get started.

andrei.jpg

Hello Andrei, welcome on Server-Side Magazine.

1. Tell us a little bit about yourself. Who are you? Where and what do you work?

Who am I? Ah, the coffee breath of one talking about himself. Well let me try. I’m a hacker living in the US, originally from Romania. In 2001 I wrote a book called “Modern C++ Design” that was notable enough to convince me to do anything I could to avoid being typecast as a “C++ guy”.

I don’t want to be an “XYZ guy” forever. Such efforts included eight years spent on a doctorate in Machine Learning for Natural Language Processing partially overlapped with five years working on the D programming language.

I currently work at Facebook. The high-brow title of “Research Scientist” is a wildcard covering a variety of tasks, which I’m lucky to have considerable freedom in choosing. Sampling my work week you could find me messing with some bash script, reviewing some code, making an off-by-one error, or pulling my hair over said bug.

I’m currently involved in three lines of work. One is working on our core C++ libraries (I know, typecast!) on an ongoing basis. We’re considering open sourcing some of those. The second is optimizing Hive, a distributed database system we use intensively at Facebook. Finally, my most research work is actually a continuation of my thesis.

I do research on graph-based machine learning. And guess who has the biggest and most interesting graph in the Universe! Machine learning on the social graph can help a variety of applications in spam detection, site integrity, and much more.

2. What’s your development setup?

The most significant bit would be Windows vs. Unix, right? Facebook being a Unix shop, it uses Unix tools through and through. Let me dive into our tools a bit.

Each Facebook engineer gets a choice of MacBook or Windows laptop, plus the invariable 30′ monitor (yum). But development is not really happening on the laptop itself; to get any work done, engineers connect to a remote Linux machine (each engineer has one assigned) using a variety of protocols over ssh (plain terminal, nx, vnc, and probably more).

There’s freedom in choosing editors, so the usual suspects – emacs and vim – are quite popular, with some Eclipse and others here and there. I personally prefer emacs via nx, the combo works quite swimmingly even over a slow connection.

We also have a lot of cool organizational tools, many developed in house. That sounds a bit NIHish, but the history behind it is that we tried hard to make off-the-shelf tools work at the scale and quality we need them to, failed, and had to write our own.

The tools use our own technologies (talk about dog food) so they work, look, and integrate beautifully. Best part, if someone doesn’t like something, well, they can just fix it. (To wit, our email and calendar software is off-the-shelf and is the most unpleasant tool to deal with. Get this – we have a few people “specialized” in sending large meeting invites out, because there are bugs that require peculiar expertise to work around. Not to mention that such invites come with “Do not accept from an iPhone lest you corrupt the invite for everyone!”)

Anyway, back to our tool chain. Once an engineer makes a code change that passes unit tests and lint, they submit for review a so-called “diff” via our Phabricator system, which we open sourced.

The reviewers are selected partly manually, partly automatically; virtually not one line of code is committed without having been inspected by at least one (other) engineer. Phabricator is great at this flow, making diff analysis, comment exchange, and revision updates very handy. I’d recommend it.

Once the diff has been approved, the author uploads it to our central git repository. We love git; when I joined two years ago, we were just starting to migrate from svn to git, and today we virtually all use git. Some of us (including myself) wrote a few popular git scripts that integrate with our workflow.

To build C++ code we have our own build system driving a build farm. I don’t do front-end work, so I don’t know many details in that area; in broad strokes, we use the recently-released HipHop Virtual Machine (HHVM) for development, and the static HipHop compiler for the production site.

We have quite a few more browser-based tools for improving workflow, such as task management, discussions, wiki, peer review, recruiting and interviewing, analytics, systems management, and many many more. Really for pretty much any typical need “there’s an app for that”. And if there isn’t, there’s a vast infrastructure allowing you to build one quickly.

3. How did you got started coding in C++ ?

I think for many of us, starting with a language has a lot of happenstance associated with it – much like picking a favorite sports team or getting a crush on someone. For me, much of it was the buzz in the college. Pascal was the prevalent language, but I recall the jocks talking “C is the real thing, dude” and “C++ is cool like C, but it has objects, encapsulation, the works… that’s good if you want to build large applications!”

I thought, heck, large applications are the place to be. A stolen electronic copy of Bjarne’s opus and a great book by Romanian C++ celebrity Ionut Muslea (whom I got to incidentally know years later) sealed the deal.

One funny thing – I worked as a programmer for a while in Romania before moving to the US, and I was convinced my understanding of C++ was mediocre at best. Imagine my surprise when coworkers at my first job in the US quickly came to consider me quite the C++ connoisseur – and, as I mentioned, that’s what I became no matter what else I tried.

4. What do you think of PHP as a language from your perspective, regarding that Facebook was initially written in PHP then transformed to C++ using HipHop for PHP. What are the pros and cons of using C++ over PHP at Facebook?

I was afraid you were going to ask that 😮

Let me first clarify that HipHop for PHP (HPHP) is a PHP to C++ translator, but the generated code is not meant for human consumption; it’s just compiled. We still write PHP code for our front-end.

Probably I won’t bruise any ego by saying I don’t consider PHP a well-designed language; it makes a lot of the classic mistakes and even invents a couple of new ones. Rasmus Ledorf is a self-confessed dilettante in language design. But let’s not forget that Rasmus is a brilliant hacker who likes to get things done.

So he connected and integrated PHP with everything worth connecting and integrating, and made every typical task in web design no further than a few hacks away. I think that’s where the power of PHP lies. For example, I’d say D is a better-designed language, but it would be a major undertaking for me to integrate it with Apache the way PHP is.

Facebook’s outlook of PHP is largely passionless; yes, engineers understand it is far from perfect, and people occasionally rant or show some WTF code sample. At the same time, at Facebook we love doing cool things, and PHP is simply a means to an end. With our extensive framework and libraries, it’s also often the simplest means to an end.

We also added important improvements to PHP that help a lot. XHP is an extension to PHP that allows embedded XML in PHP. Our front-end engineers swear by it.

We’ve also added a Python-like yield. One topic that’s been discussed several times is adding a “real” array type; PHP’s arrays are a rather odd concoction, and not efficient enough for our needs.

That being said, one cannot stop thinking about costs and liabilities. HPHP is a major Facebook project to which many engineers participated. We also have a newer project, HHVM, which is a fast virtual machine for PHP.

That’s great for our testing and debugging cycle because it obviates the expensive steps of translating to C++ and compiling it. All this work raises the classic question: are we spending too much on fixing the car to afford a new one?

At this point, probably not. There’s no language that we could switch to without a major disruption, and the incremental benefits would not be all that large; you see, PHP is far from perfect but our engineers are very good, and with a combination of talent, discipline, and tools, our PHP code base is in good shape.

The same goes about our C++ code base. We use C++ heavily on our back-end systems, and actually there has been a visible net increase of C++ use compared to PHP and others at Facebook since two years ago.

To compare C++ with PHP (and with this I’m finally getting to your exact question), in a way we have more incentive to write good C++ than good PHP, because the penalty is so much harsher: the typical C++ bug is more difficult to fix than the typical PHP one, and the impact of a back-end issue is often larger than that of a front-end issue. So I’d say we have a better C++ code base just on account of survival needs.

Compilation times (and generally managing dependencies and physical design) remain a large issue with C++. We have a build farm driven by sophisticated software, which reduces the problem’s size from unbearable to just major. Deployment has also its issues due to the sheer scale of our systems, but we have much more control there.

At the end of the day, the performance is there, and we need every single ounce of it. We have concurrent hash tables with millions and even billions of elements, so we’re careful about our hash functions; we know the vast distinctions between sorting and stable sorting; we know and care about the cost of memory allocation, locking, or atomic increment. C++ allows us control over all of that, whereas PHP and most other languages wouldn’t.

5. Currently, what kind of research do you conduct at Facebook? (or is this confidential?)

I can discuss it in broad strokes.

Well, I was lucky to be able to continue my thesis work (in addition to a few other directions). This carries a risk – a common mistake of PhD graduates is to try to apply their thesis everywhere. In my case there was some serendipity involved; let me explain.

My research topic was machine learning using similarity graphs. In brief, say you have a lot of data points (users, servers, posts, clicks, you name it) and you want to figure out some interesting properties about them. In a graph-based approach you throw all data points in a graph and you link them by similarity edges, after which you look at how they cluster together.

For example, let’s say we want to figure out whether a public status update is a meme or a joke versus original content. In that case, you need to figure out a “similarity” between any two posts. That’s not too difficult (but not trivial, either) as there are plenty of approximate string comparison metrics. If there are many similar posts from unrelated persons, those are likely to form a meme – the kind that is copied and pasted over and over again.

Then, what’s meme detection for? Well it has plenty of good uses. It allows better content grouping so you don’t see near-identical things all over your news feed. It also allows identifying chaff, spam, and possibly even malevolent content. (Note that this is only a hypothetical project; we do have attacks on these issues, but they currently use other methods.)

One really cool thing about doing machine learning at Facebook is the sheer size of data involved; most researchers are happy to put together a graph with some hundreds of thousands of nodes. At Facebook samples of interest are in the hundreds of millions and more. It took a few of us quite a while to figure out how to distribute graph processing on many machines, but we finally did it. We should be able to publish the method soon.

6. Tell us a little bit about the D programming language, in contrast to C, PHP, Ruby and others. In what fields can someone apply D? (is it mostly research / scientific applications, web environments?)

D is aiming to occupy a surprisingly large and underpopulated area on the landscape of programming languages, laying between the convenience of scripting languages, the safety and modeling power of full-featured type systems, and the control and performance of system-level languages.

Conventional wisdom asserts these needs are in tension, and people have been advised to compromise depending on the most pressing requirements. But recent progress in programming languages theory and practice has slowly but surely reduced these gaps.

We designed D as a modern language that replies with “and”s wherever people would have otherwise needed to answer to “or”s. In reply to your question – research? scientific applications? web environments? – I’d say: all of the above, and many more.

In the web space in particular, D could do great but doesn’t benefit from extensive libraries and frameworks like Ruby or PHP. I wish I’d convince a serious hacker to bring things to the point where

<?d writeln("Hello, world!"); ?>

could be inserted in a web page.

Even today people write lean and mean web apps in D. Consider Vladimir Panteleev’s NNTP-HTTP proxy. It converts a large Usenet newsgroup to HTML format in real time, and is written exclusively in D. I may be subjective, but the thing is very snappy, unlike similar server-side software written in other languages.

There would be a lot more to say, but I won’t get into details that can be easily found. Let me just say one surprising thing we learned. It’s difficult to keep a language in good balance. It’s beneficial and tempting to restrict the design space: everything is a list, hash table, or object; everything is mutable or immutable; everything is eager or lazy; and so on.

Doing too much of that robs the language expressive power (lists are great, except when you need a contiguous array). Doing too little leads to non-committal wish-wash. We tried hard to not fall into either of these extremes, and I feel we haven’t done too bad at it. We still have work to do on bringing implementation quality on par with the ambition of the design, and of course adding library facilities.

7. What advice would you give to a beginner server-side developer and what to an open-source author / contributor?

Meh, advice-shmadvice. Who am I to give advice? What’s experience good for in our field? I used to know how to write a keyboard driver and floppy disc formatters with non-standard densities. I’d be all but obsolete today if I hadn’t learned continuously.

Oh, so maybe this would be one piece of advice: learn how to learn, and stick with principles; mastering individual technologies will follow.

Technology is a great servant but a terrible master. The best people in our métier are those who know how to quickly become experts in some particular field.

8. Also, what kind of advice can you give for developers who are considering to apply to Facebook? What kind of skills is Facebook looking for in a potential candidate. Is it really important to be a graduate CS? What kind of skills do the majority of Facebook employees possess?

First and foremost, the ideal Facebook engineering candidate should be good at coding. I know this sounds silly, but a surprising fraction of the people we interview could improve their coding skills instead of focusing on mastering specific languages or technologies. We look for good generalists able to move freely within the organization.

Feel free to choose your favorite programming language when interviewing. The typical Facebook interviewer accepts a choice of 3 – 6 languages, subject to the ones she’s comfortable assessing coding ability in.

The candidates I personally interviewed used Java, C, C++, C#, Javascript, Python, Ruby, PHP, Pascal, and even pseudo-code. (I recall I recommended “hire” for the pseudo-code guy.)

We do care about one’s ability to code; code is quintessential for us, for engineers from front-end to back-end to system configurators to researchers. You know how you figure someone can play basketball, or act, in a mere few seconds? Same goes about coding – you can tell whether one can code by seeing how they approach implementing some simple algorithm.

Consider for example someone who chooses C or C++. Then a good warmup question I might ask is “Implement strstr()”. It has a simple spec (so we don’t waste time with explanations) and allows the candidate to show they can code.

Just do a clean implementation of the brute force algorithm – no need to memorize Boyer-Moore and others. The canonical solution looks somewhat like this:

char* strstr(char* haystack, char* needle) {
    for (;; ++haystack) {
        char* h = haystack;
        for (char* n = needle;; ++n, ++h) {
            if (!*n) return haystack;
            if (*h != *n) break;
        }
        if (!*h) return NULL;
    }
}

Using indexes is fine, too, as are myriads of alternatives. Ten people implement it in eleven ways, which is fine as long as things don’t get too complicated or plain broken. The code above is a good baseline because it’s simple – no special casing for “the empty string is a substring of any string including the empty string”, “a longer string cannot be a substring of a shorter string” etc.

The function organizes computation systematically towards achieving the result. Compare with e.g. the more complex solutions at http://goo.gl/U4poN (the short function above implements the optimized algorithm given there).

This question is a Facebook interview classic, known all over the Net (http://goo.gl/glnAz). I ask much more complicated questions when interviewing (some I don’t even know the answer for), but a candidate unable to lift strstr (or similar) off the ground cannot be a Facebook engineer. It is surprising how many fail at it or get it wrong.

I focused above on the coding part because it’s a gating factor. We have other important criteria, such as cultural fit and design / architectural abilities. Even specialists (in e.g. machine learning) must impress when it comes to coding and other general skills, in addition to being great at their specialty (our recruiters pair such candidates with engineers particularly strong in the same area).

The ideal Facebook engineer is a great hacker, a strong generalist (and in addition possibly exceptional depth in some area), and an adaptable person comfortable working in small, fluid teams.

Regarding degrees, I asked Sean Murphy, lead recruiter with Facebook. His exact answer was:

“It’s helpful to have a strong CS background but not a requirement. We have many engineers who are doing really well with degrees in math/physics/symbolic systems and other areas or no degree at all. We have some really smart people and some really creative people with no CS background. For foreign candidates, its much easier to get a work visa with a degree but we’ve hired some engineers with stellar experience and no degree.“

9. Where can we catch you this year?

Coming up is Microsoft’s GoingNative 2012 conference on Feb 2-3, where I’ll give two talks and attend a panel (in a star-studded company that will probably put me to shame: Bjarne Stroustrup, Herb Sutter, Hans Boehm). Nevertheless this is very exciting.

Later in the year there’s the C++ and Beyond event on Aug 5-8 in Asheville, NC and at the LASER Summer School on Software Engineering.

I’ve submitted proposals to the Strange Loop Conference and OSCON, and I’ll announce the results on my website. A couple of us at Facebook also plan to work on a Machine Learning paper, which we’ll probably submit to NIPS.

Last but not least, I hope to attend and possibly participate to some of Facebook’s tech talks, and I warmly invite like-minded hackers to join.

10. Do you have any future projects you wish to share with Us?

At Facebook we’re working on some quite cool performance improvements to Hadoop / Hive. That work is not exactly “server-side” in the sense it’s oriented more towards offline storage and retrieval, than traditional online server-side data access. We hope to open source the project some time this year.

This year may also see the launch of some of Facebook’s core C++ library code. We’re quite excited about that; there is some really cool stuff in there, most of which is directly aimed at high-performance server-side computing. Definitely something to watch for.

Outside Facebook, my copious free time is dedicated to working on the D programming language. I think D is very relevant to server-side software and will become more relevant this year.

The issue I see at this point is that we don’t have a strong champion to put together last-mile integration libraries (e.g. MySQL, Apache, CGI, and such), and push them in the standard library. George, if you know someone just let me know :o.


Original Comments

original-comments.png