Vous êtes sur la page 1sur 6

We are now at 86 questions.

These are mostly open-ended questions, to assess the technical horizontal knowledge of a senior candidate for a rather high level position, e.g. director. . What is the !iggest data set that you processed, and how did you process it, what were the results" #. Tell me two success stories a!out your analytic or computer science pro$ects" %ow was lift &or success' measured" (. What is) lift, *+,, ro!ustness, model fitting, design of e-periments, 8./#. rule" 0. What is) colla!orative filtering, n-grams, map reduce, cosine distance" 1. %ow to optimize a we! crawler to run much faster, e-tract !etter information, and !etter summarize data to produce cleaner data!ases" 6. %ow would you come up with a solution to identify plagiarism" 2. %ow to detect individual paid accounts shared !y multiple users" 8. 3hould click data !e handled in real time" Why" ,n which conte-ts" 4. What is !etter) good data or good models" 5nd how do you define 6good6" ,s there a universal good model" 5re there any models that are definitely not so good" .. What is pro!a!ilistic merging &5*5 fuzzy merging'" ,s it easier to handle with 378 or other languages" Which languages would you choose for semistructured te-t data reconciliation" . %ow do you handle missing data" What imputation techniques do you recommend" #. What is your favorite programming language / vendor" why" (. Tell me ( things positive and ( things negative a!out your favorite statistical software. 0. 9ompare 353, :, +ython, +erl 1. What is the curse of !ig data" 6. %ave you !een involved in data!ase design and data modeling" 2. %ave you !een involved in dash!oard creation and metric selection" What do you think a!out ;irt" 8. What features of Teradata do you like" 4. <ou are a!out to send one million email &marketing campaign'. %ow do you optimze delivery" %ow do you optimize response" 9an you optimize !oth separately" &answer) not really'

#.. Toad or ;rio or any other similar clients are quite inefficient to query =racle data!ases. Why" %ow would you do to increase speed !y a factor ., and !e a!le to handle far !igger outputs" # . %ow would you turn unstructured data into structured data" ,s it really necessary" ,s it =* to store data as flat te-t files rather than in an 378powered :>;?3" ##. What are hash ta!le collisions" %ow is it avoided" %ow frequently does it happen" #(. %ow to make sure a mapreduce application has good load !alance" What is load !alance" #0. @-amples where mapreduce does not work" @-amples where it works very well" What are the security issues involved with the cloud" What do you think of @?9As solution offering an hy!rid approach - !oth internal and e-ternal cloud - to mitigate the risks and offer other advantages &which ones'" #1. ,s it !etter to have .. small hash ta!les or one !ig hash ta!le, in memory, in terms of access speed &assuming !oth fit within :5?'" What do you think a!out in-data!ase analytics" #6. Why is naive ;ayes so !ad" %ow would you improve a spam detection algorithm that uses naive ;ayes" #2. %ave you !een working with white lists" +ositive rules" &,n the conte-t of fraud or spam detection' #8. What is star schema" 8ookup ta!les" #4. 9an you perform logistic regression with @-cel" &yes' %ow" &use linest on logtransformed data'" Would the result !e good" &@-cel has numerical issues, !ut itAs very interactive' (.. %ave you optimized code or algorithms for speed) in 378, +erl, 9BB, +ython etc. %ow, and !y how much" ( . ,s it !etter to spend 1 days developing a 4.C accurate solution, or . days for ..C accuracy" >epends on the conte-t" (#. >efine) quality assurance, si- sigma, design of e-periments. Dive e-amples of good and !ad designs of e-periments. ((. What are the draw!acks of general linear model" 5re you familiar with alternatives &8asso, ridge regression, !oosted trees'" (0. >o you think 1. small decision trees are !etter than a large one" Why" (1. ,s actuarial science not a !ranch of statistics &survival analysis'" ,f not, how so" (6. Dive e-amples of data that does not have a Daussian distri!ution, nor lognormal. Dive e-amples of data that has a very chaotic distri!ution"

(2. Why is mean square error a !ad measure of model performance" What would you suggest instead" (8. %ow can you prove that one improvement youAve !rought to an algorithm is really an improvement over not doing anything" 5re you familiar with 5/; testing" (4. What is sensitivity analysis" ,s it !etter to have low sensitivity &that is, great ro!ustness' and low predictive power, or the other way around" %ow to perform good cross-validation" What do you think a!out the idea of in$ecting noise in your data set to test the sensitivity of your models" 0.. 9ompare logistic regression w. decision trees, neural networks. %ow have these technologies !een vastly improved over the last 1 years" 0 . >o you know / used data reduction techniques other than +95" What do you think of step-wise regression" What kind of step-wise techniques are you familiar with" When is full data !etter than reduced data or sample" 0#. %ow would you !uild non parametric confidence intervals, e.g. for scores" &see the 5nalytic;ridge theorem' 0(. 5re you familiar either with e-treme value theory, monte carlo simulations or mathematical statistics &or anything else' to correctly estimate the chance of a very rare event" 00. What is root cause analysis" %ow to identify a cause vs. a correlation" Dive e-amples. 01. %ow would you define and measure the predictive power of a metric" 06. %ow to detect the !est rule set for a fraud detection scoring technology" %ow do you deal with rule redundancy, rule discovery, and the com!inatorial nature of the pro!lem &for finding optimum rule set - the one with !est predictive power'" 9an an appro-imate solution to the rule set pro!lem !e =*" %ow would you find an =* appro-imate solution" %ow would you decide it is good enough and stop looking for a !etter one" 02. %ow to create a keyword ta-onomy" 08. What is a ;otnet" %ow can it !e detected" 04. 5ny e-perience with using 5+,As" +rogramming 5+,As" Doogle or 5mazon 5+,As" 5aa3 &5nalytics as a service'" 1.. When is it !etter to write your own code than using a data science software package" 1 . Which tools do you use for visualization" What do you think of Ta!leau" :" 353" &for graphs'. %ow to efficiently represent 1 dimension in a chart &or in a video'" 1#. What is +=9 &proof of concept'"

1(. What types of clients have you !een working with) internal, e-ternal, sales / finance / marketing / ,T people" 9onsulting e-perience" >ealing with vendors, including vendor selection and testing" 10. 5re you familiar with software life cycle" With ,T pro$ect life cycle - from gathering requests to maintenance" 11. What is a cron $o!" 16. 5re you a lone coder" 5 production guy &developer'" =r a designer &architect'" 12. ,s it !etter to have too many false positives, or too many false negatives" 18. 5re you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence" Dive e-amples. 14. %ow does EillowAs algorithm work" &to estimate the value of any home in F3' 6.. %ow to detect !ogus reviews, or !ogus Gace!ook accounts used for !ad purposes" 6 . %ow would you create a new anonymous digital currency" 6#. %ave you ever thought a!out creating a startup" 5round which idea / concept" 6(. >o you think that typed login / password will disappear" %ow could they !e replaced" 60. %ave you used time series models" 9ross-correlations with time lags" 9orrelograms" 3pectral analysis" 3ignal processing and filtering techniques" ,n which conte-t" 61. Which data scientists do you admire most" which startups" 66. %ow did you !ecome interested in data science" 62. What is an efficiency curve" What are its draw!acks, and how can they !e overcome" 68. What is a recommendation engine" %ow does it work" 64. What is an e-act test" %ow and when can simulations help us when we do not use an e-act test" 2.. What do you think makes a good data scientist" 2 . >o you think data science is an art or a science" 2#. What is the computational comple-ity of a good, fast clustering algorithm" What is a good clustering algorithm" %ow do you determine the num!er of clusters" %ow would you perform clustering on one million unique keywords, assuming you have . million data points - each one consisting of two

keywords, and a metric measuring how similar these two keywords are" %ow would you create this . million data points ta!le in the first place" 2(. Dive a few e-amples of 6!est practices6 in data science. 20. What could make a chart misleading, difficult to read or interpret" What features should a useful chart have" 21. >o you know a few 6rules of thum!6 used in statistical or computer science" =r in !usiness analytics" 26. What are your top 1 predictions for the ne-t #. years" 22. %ow do you immediately know when statistics pu!lished in an article &e.g. newspaper' are either wrong or presented to support the authorAs point of view, rather than correct, comprehensive factual information on a specific su!$ect" Gor instance, what do you think a!out the official monthly unemployment statistics regularly discussed in the press" What could make them more accurate" 28. Testing your analytic intuition) look at these three charts. Two of them e-hi!it patterns. Which ones" >o you know that these charts are called scatter-plots" 5re there other ways to visually represent this type of data" 24. <ou design a ro!ust non-parametric statistic &metric' to replace correlation or : square, that & ' is independent of sample size, &#' always !etween - and B , and &(' !ased on rank statistics. %ow do you normalize for sample size" Write an algorithm that computes all permutations of n elements. %ow do you sample permutations &that is, generate tons of random permutations' when n is large, to estimate the asymptotic distri!ution for your newly created metric" <ou may use this asymptotic distri!ution for normalizing your metric. >o you think that an e-act theoretical distri!ution might e-ist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distri!ution using simulations" 8.. ?ore difficult, technical question related to previous one. There is an o!vious one-to-one correspondence !etween permutations of n elements and integers !etween and nH >esign an algorithm that encodes an integer less than nH as a permutation of n elements. What would !e the reverse algorithm, used to decode a permutation and transform it !ack into a num!er" Hint) 5n intermediate step is to use the factorial num!er system representation of an integer. Geel free to check this reference online to answer the question. @ven !etter, feel free to !rowse the we! to find the full answer to the question &this will test the candidateAs a!ility to quickly search online and find a solution to a pro!lem without spending hours reinventing the wheel'. 8 . %ow many 6useful6 votes will a <elp review receive" My answer) @liminate !ogus accounts &read this article', or competitor reviews &how to detect them) use ta-onomy to classify users, and location - two ,talian restaurants in same Eip code could !admouth each other and write great comments for themselves'. >etect fake likes) some companies &e.g. Gan?eIow.com' will charge you to produce fake accounts and fake likes. @liminate prolific users who like everything, those who hate everything. %ave a !lacklist of keywords

to filter fake reviews. 3ee if ,+ address or ,+ !lock of reviewer is in a !lacklist such as 63top Gorum 3pam6. 9reate honeypot to catch fraudsters. 5lso watch out for disgruntled employees !admouthing their former employer. Watch out for # or ( similar comments posted the same day !y ( users regarding a company that receives very few reviews. ,s it a !rand new company" 5dd more weight to trusted users &create a category of trusted users'. Glag all reviews that are identical &or nearly identical' and come from same ,+ address or same user. 9reate a metric to measure distance !etween two pieces of te-t &reviews'. 9reate a review or reviewer ta-onomy. Fse hidden decision trees to rate or score review and reviewers. 8#. What did you do today" =r what did you do this week / last week" 8(. What/when is the latest data mining !ook / article you read" What/when is the latest data mining conference / we!inar / class / workshop / training you attended" What/when is the most recent programming skill that you acquired" 80. What are your favorite data science we!sites" Who do you admire most in the data science community, and why" Which company do you admire most" 81. What/when/where is the last data science !log post you wrote" 86. ,n your opinion, what is data science" ?achine learning" >ata mining" 82. Who are the !est people you recruited and where are they today" 88. 9an you estimate and forecast sales for any !ook, !ased on 5mazon pu!lic data" %int) read this article. 84. WhatAs wrong with this picture"

Vous aimerez peut-être aussi