A fresh scrape from Glassdoor gives us a good idea about what applicants are asked during a data scientist interview at some of the top companies. Unfortunately for us, almost every company has their interviewees sign NDAs. Since Glassdoor allows anonymity, a few brave souls have given us some fantastic examples of what they were asked during the interview process at top companies like Facebook, Google, and Microsoft.

来自 Glassdoor 的最新数据可以告诉我们各大科技公司最近在招聘面试时最喜欢向候选人提什么问题。首先有一个令人惋惜的结论:根据统计,几乎所有的公司都有着自己的不同风格。由于 Glassdoor 允许匿名提交内容,很多乐于分享的应聘者向大家提供了 Facebook、谷歌、微软等大公司的面试题。

General Questions 


  1. Suppose you’re given millions of users that each have hundreds of transactions and these millions of transactions are for tens of thousands of products. How would you group the users together in meaningful segments?



  1. Describe a project you’ve worked on and how it made a difference.


  2. How would you approach a categorical feature with high-cardinality?


  3. What would you do to summarize a Twitter feed?

    如果想要给 Twitter feed 写 summarize,你要怎么办?

  4. What are the steps for wrangling and cleaning data before applying machine learning algorithms?


  5. How do you measure distance between data points?


  6. Define variance.


  7. Describe the differences between and use cases for box plots and histograms.

    请描述箱形图(box plot)和直方图(histogram)之间的差异,以及它们的用例。


  1. What features would you use to build a recommendation algorithm for users?



  1. Pick any product or app that you really like and describe how you would improve it.


  2. How would you find an anomaly in a distribution ?


  3. How would you go about investigating if a certain trend in a distribution is due to an anomaly?


  4. How would you estimate the impact Uber has on traffic and driving conditions?

    如何估算 Uber 对交通和驾驶环境造成的影响?

  5. What metrics would you consider using to track if Uber’s paid advertising strategy to acquire new customers actually works? How would you then approach figuring out an ideal customer acquisition cost?

    你会考虑用什么指标来跟踪 Uber 付费广告策略在吸引新用户上是否有效?然后,你想用什么办法估算出理想的客户购置成本?


  1. Big Data Engineer Can you explain what REST is?

    (大数据工程师)请解释 REST 是什么。

Machine Learning Questions 


  1. Why do you use feature selection?

    为什么要使用特征选择(feature selection)?

  2. What is the effect on the coefficients of logistic regression if two predictors are highly correlated? What are the confidence intervals of the coefficients?


  3. What’s the difference between Gaussian Mixture Model and K-Means?

    高斯混合模型(Gaussian Mixture Model)和 K-Means 之间有什么区别?

  4. How do you pick k for K-Means?

    在 K-Means 中如何拾取 k?

  5. How do you know when Gaussian Mixture Model is applicable?


  6. Assuming a clustering model’s labels are known, how do you evaluate the performance of the model?



  1. What’s an example of a machine learning project you’re proud of?


  2. Choose any machine learning algorithm and describe it.


  3. Describe how Gradient Boosting works.

    请解释 Gradient Boosting 是如何工作的。

  4. Data Mining Describe the decision tree model.


  5. Data Mining What is a neural network?


  6. Explain the Bias-Variance Tradeoff

    请解释偏差方差权衡(Bias-Variance Tradeoff)。

  7. How do you deal with unbalanced binary classification?


  8. What’s the difference between L1 and L2 regularization?

    L1 和 L2 正则化之间有什么区别?


  1. What sort features could you give an Uber driver to predict if they will accept a ride request or not? What supervised learning algorithm would you use to solve the problem and how would compare the results of the algorithm?

    你会通过哪种特征来预测 Uber 司机是否会接受订单请求?你会使用哪种监督学习算法来解决这个问题,如何比较算法的结果?


  1. Name and describe three different kernel functions and in what situation you would use each.


  2. Describe a method used in machine learning.


  3. How do you deal with sparse data?



  1. How do you prevent overfitting?


  2. How do you deal with outliers in your data?


  3. How do you analyze the performance of the predictions generated by regression models versus classification models?


  4. How do you assess logistic regression versus simple linear regression models?


  5. What’s the difference between supervised learning and unsupervised learning?


  6. What is cross-validation and why would you use it?


  7. What’s the name of the matrix used to evaluate predictive models?


  8. What relationships exist between a logistic regression’s coefficient and the Odds Ratio?

    逻辑回归系数和胜算比(Odds Ratio)之间存在怎样的关联?

  9. What’s the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)


  10. If you had a categorical dependent variable and a mixture of categorical and continuous independent variables, what algorithms, methods, or tools would you use for analysis?


  11. Business Analytics What’s the difference between logistic and linear regression? How do you avoid local minima?



  1. What data and models would would you use to measure attrition/churn? How would you measure the performance of your models?


  2. Explain a machine learning algorithm as if you’re talking to a non-technical person.


Capital One

  1. How would you build a model to predict credit card fraud?


  2. How do you handle missing or bad data?


  3. How would you derive new features from features that already exist?


  4. If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?

    如果你试图预测客户的性别,但只有 100 个数据点,可能会出现什么问题?

  5. Suppose you were given two years of transaction history. What features would you use to predict credit risk?


  6. Design an AI program for Tic-tac-toe



  1. Explain overfitting and what steps you can take to prevent it.


  2. Why does SVM need to maximize the margin between support vectors?

    为什么 SVM 需要在支持向量之间最大化边缘?



  1. How would you use Map/Reduce to split a very large graph into smaller pieces and parallelize the computation of edges according to the fast/dynamic change of data?

    如何使用 Map/Reduce 将非常大的图形分割成更小的块,并根据数据的快速/动态变化并行计算它们的边缘?

  2. Data Engineer Given a list of followers in the format:123, 345234, 678345, 123…Where column one is the ID of the follower and column two is the ID of the followee. Find all mutual following pairs (the pair 123, 345 in the example above). How would you use Map/Reduce to solve the problem when the list does not fit in memory?

    (数据工程师)给定一个列表:123, 345234, 678345, 123…其中第一列是粉丝的 ID,第二列是被粉者的 ID。查找所有相互后续对(上面的示例中的对是 123,345)。当列表超出内存时,如何使用 Map / Reduce 来解决问题?

Capital One

  1. Data Engineer What is Hadoop serialization?

    (对数据工程师)什么是 Hadoop 序列化(serialization)?

  2. Explain a simple Map/Reduce problem.

    阐述一个简单的 Map / Reduce 问题。



  1. Data Engineer Write a Hive UDF that returns a sentiment score. For example, if good = 1, bad = -1, and average = 0, then a review of a restaurant states “Good food, bad service,” your score might be 1 – 1 = 0.

    (数据工程师)请编写返回情感分数的 Hive UDF。例如,假如好=1,坏=-1,平均数=0,那么对餐厅做评价时因为「食物好,服务差」,你的分数可能为 1 - 1 = 0


Capital One

  1. Data Engineer Explain how RDDs work with Scala in Spark

    (数据工程师)阐释使用 Scala 语言时RDD 在 Spark 中是如何工作的?

Statistics & Probability Questions


  1. Explain Cross-validation as if you’re talking to a non-technical person.


  2. Describe a non-normal probability distribution and how to apply it.



  1. Data Mining Explain what heteroskedasticity is and how to solve it



  1. Given Twitter user data, how would you measure engagement?

    在给定 Twitter 用户数据的情况下,你该如何衡量参与度?


  1. What are some different Time Series forecasting techniques?


  2. Explain Principle Component Analysis (PCA) and equations PCA uses.

    解释原理组件分析(PCA)及其 使用的方程。

  3. How do you solve Multicollinearity?


  4. Analyst Write an equation that would optimize the ad spend between Twitter and Facebook.

    (分析师)请尝试列出优化我们在 推特和脸书上的广告费用支出的方程。


  1. What’s the probability you’ll draw two cards of the same suite from a single deck?



  1. What are p-values and confidence intervals?

    什么是 p-value 和置信区间?

Capital One

  1. Data Analyst If you have 70 red marbles, and the ratio of green to red marbles is 2 to 7, how many green marbles are there?

    (数据分析师)如果你有 70 个红色弹珠,绿色和红色弹珠的比例是 2 :7,有多少绿色弹珠?

  2. What would the distribution of daily commutes in New York City look like?


  3. Given a die, would it be more likely to get a single 6 in six rolls, at least two 6s in twelve rolls, or at least one-hundred 6s in six-hundred rolls?

    一个骰子,在扔 6 次的情况下出现 1 个 6 的几率,与扔 12 次的情况下出现至少两个 6 的几率,和扔 600 次出现至少 100 次 6 的几率相比哪个大?


  1. What’s the Central Limit Theorem, and how do you prove it? What are its applications?

    什么是中心极限定理(Central Limit Theorem),如何证明它?它的应用方向是什么?

Programming & Algorithms 编程和算法


  1. Data Analyst Write a program that can determine the height of an arbitrary binary tree



  1. Create a function that checks if a word is a palindrome.



  1. Build a power set.

    请构建一个幂集(power set)。

  2. How do you find the median of a very large dataset?



  1. Data Engineer Code a function that calculates the square root (2-point precision) of a given number. Follow up: Avoid redundant calculations by now optimizing your function with a caching mechanism.



  1. Suppose you’re given two binary strings, write a function adds them together without using any builtin string-to-int conversion or parsing tools. For example, if you give your function binary strings 100 and 111, it should return 1011. What’s the space and time complexity of your solution?

    假设给定两个二进制字符串,写一个函数将它们添加在一起,而不使用任何内置的字符串到 int 转换或解析工具。例如:如果给函数二进制字符串 100 和 111,它应该返回 1011。你的解决方案的空间和时间复杂性如何?

  2. Write a function that accepts two already sorted lists and returns their union in a sorted list.



  1. Data Engineer Write some code that will determine if brackets in a string are balanced


  2. How do you find the second largest element in a Binary Search Tree?


  3. Write a function that takes two sorted vectors and returns a single sorted vector.


  4. If you have an incoming stream of numbers, how would you find the most frequent numbers on-the-fly?


  5. Write a function that raises one number to another number, i.e. the pow() function.

    编写一个函数,将一个数字增加到另一个数字,就像 pow()函数一样。

  6. Split a large string into valid words and store them in a dictionary. If the string cannot be split, return false. What’s your solution’s complexity?

    将大字符串拆分成有效字段并将它们存储在 dictionary 中。如果字符串不能拆分,返回 false。你的解决方案的复杂性如何?


  1. What’s the computational complexity of finding a document’s most frequently used words?


  2. If you’re given 10 TBs of unstructured customer data, how would you go about finding extracting valuable information from it?

    如果给你10 TBs的非结构化客户数据,你会如何发现提取有价值的信息呢?

Capital One

  1. Data Engineer How would you ‘disjoin’ two arrays (like JOIN for SQL, but the opposite)?

    (对数据工程师)如何「拆散」两个数列(就像 SQL 中的 JOIN 反过来)?

  2. Create a function that does addition where the numbers are represented as two linked lists.


  3. Create a function that calculates matrix sums.


  4. How would you use Python to read a very large tab-delimited file of numbers to count the frequency of each number?

    如何使用 Python 读取一个非常大的制表符分隔的数字文件,来计算每个数字出现的频率?


  1. Write a function that takes a sentence and prints out the same sentence with each word backwards in O(n) time.

    请编写一个函数,让它能在 O(n)的时间内取一个句子并逆向打印出来。

  2. Write a function that takes an array, splits the array into every possible set of two arrays, and prints out the max differences between the two array’s minima in O(n) time.

     请编写一个函数,从一个数组中拾取,将它们分成两个可能的数组,然后打印两个数组之间的最大差值(在 O(n) 时间内)。

  3. Write a program that does merge sort.


SQL Questions


  1. Data Analyst Define and explain the differences between clustered and non-clustered indexes.


  2. Data Analyst What are the different ways to return the rowcount of a table?



  1. Data Engineer If you’re given a raw data table, how would perform ETL (Extract, Transform, Load) with SQL to obtain the data in a desired format?

    (数据工程师)如果给定一个原始数据表,如何使用 SQL 执行 ETL(提取,转换,加载)以获取所需格式的数据?

  2. How would you write a SQL query to compute a frequency table of a certain attribute involving two joins? What changes would you need to make if you want to ORDER BY or GROUP BY some attribute? What would you do to account for NULLS?

    如何编写 SQL 查询来计算涉及两个连接的某个属性的频率表?如果你想要 ORDER BY 或 GROUP BY 一些属性,你需要做什么变化?你该怎么解释 NULL?


  1. Data Engineer How would you improve ETL (Extract, Transform, Load) throughput?

    (数据工程师)如何改进 ETL(提取,转换,加载)的吞吐量?

Brain Teasers & Word Problems


  1. Suppose you have ten bags of marbles with ten marbles in each bag. If one bag weighs differently than the other bags, and you could only perform a single weighing, how would you figure out which one is different?

    假设你有 10 包弹球,每包里面都是 10 个弹球。如果其中一包的重量和其他的不同,但你只能进行一次称重,你该用什么办法?


  1. You are about to hop on a plane to Seattle and want to know if you should carry an umbrella. You call three friends of yours that live in Seattle and ask each, independently, if it’s raining.Each of your friends will tell you the truth ⅔ of the time and mess with you by lying ⅓ of the time. If all three friends answer “Yes, it’s raining,” what is the probability that is it actually raining in Seattle?

    你打算坐飞机去西雅图,想知道是不是需要带伞,于是你分别打电话给三位在西雅图的朋友。每个朋友都有 2/3 的几率说真话,1/3 的几率在骗你。如果他们都说「会下雨」,西雅图下雨的概率是多少?


  1. Imagine you are working with a hospital. Patients arrive at the hospital in a Poisson Distribution, and the doctors attend to the patients in a Uniform Distribution. Write a function or code block that outputs the patient’s average wait time and total number of patients that are attended to by doctors on a random day.



  1. Imagine there are three ants in each corner of an equilateral triangle, and each ant randomly picks a direction and starts traversing the edge of the triangle. What’s the probability that none of the ants collide? What about if there are N ants sitting in N corners of an equilateral polygon?

    假如在一个等边三角形的三个角上都有一只蚂蚁,每只随机选择方向然后直走一直到另一个边缘,三只蚂蚁互相不交汇的几率是多少?如果有 n 只蚂蚁在 n 角形中,概率又是多少?

  2. How many trailing zeros are in 100 factorial (i.e. 100!)?

    在 100! 的结果里有多少个零?


  1. Imagine you’re climbing a staircase that contains n stairs, and you can take any number k steps. How many distinct ways can you reach the top of the staircase? (This is a modification of the original stair step problem)

    你正在攀爬一个 n 阶的楼梯,你可以采取任何数量的 k 个步骤。你到达楼梯顶部有多少不同的方式?(这是楼梯问题的修改版)

博文来自来源: 微信平台