Read the Beforeitsnews.com story here. Advertise at Before It's News here.

By Peadar Coyleas
Contributor profile | More stories

Story Views
Now:
Last hour:
Last 24 hours:
Total:

Interview with a Data Scientist – Cameron Davidson Pilon

Saturday, January 26, 2019 12:28

% of readers think this story is Fact. Add your two cents.

(Reprint from a few years ago)

Cameron is an open source contributor, a pythonista and a data geek – he’s developed various cool libraries. His blog is worth a read, and I personally recommend his screencasts.

He’s got a strong Mathematical background like myself, and he currently is Lead Data Analyst in a Data Science job for Shopify. He’s possibly most famous in the Python community for his excellent Bayesian Methods for Hackers. I also had the honour of contributing to that project.

1. What project have you worked on do you wish you could go back to, and do better?

1. For sure, it was my projects during 2012 when I first started to enter Kaggle competitions. The two in particular I wish I could redo were the Twitter Psychopaths challenge and the US Census Return Rate challenge. In both challenges I made some serious high-level errors (but that’s the point of these challenges, to discover mistakes before they happen when it really matters!) I’ve detailed my mistake in the US Census challenge in my latest PyData presentation “Mistakes I’ve Made”, . Basically I ignored population variance and replaced it with machine learning egotism. Oh, I also remembered another project I would really love to go back to. In 2011, when I was doing research into stochastic processes, I started my first Python library (if you could even call it that) called PyProcess. You can still see it here: https://github.com/camdavidsonpilon/pyprocess.

Notice that it is, embarrassingly, one large file filled with Python classes. The first iteration didn’t even use Numpy! I would love to go back and redo the entire thing, but two things hold me back: 1) It was a lot of work to test each stochastic process and make sure they were doing the right, and 2) I’m do far out of the field now.

(Editor note: I personally used PyProcess during some of my Financial Mathematics coursework and always meant to try to add to the project, but never did)

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

2. If you’re not already learning and using Python or Scala, do that. Similarly, if you’re not already learning some software engineering, do that. What are some examples of data science software engineering? – writing (close to) professional level code – thinking proper abstractions, writing testable pieces, thinking about reusability. – having code reviewed, and reviewing code yourself – writing tests Why do I emphasize programming and software development so much? At a high level, data science is about using computers to do statistics for you. If you can’t properly use the former, then your most important tool in your toolbox is missing.

3. What do you wish you knew earlier about being a data scientist?

3. I wish I, and the rest of the field, knew about data cleaning. This is an important part of the whole data story and is glossed over. Specifically, the ETL pipeline (extract-transform-load). What I use to do is use SQL for the T part, but this caused too many problems (untestable, unmaintainable, unscalable). Now that is done prior to me even using the data for anything remotely complicated. This saves me time later, and allows the entire team to scale and benefit from my work (yes, I am still writing ETLs – I expect all my team members to, too). The problem is, you can’t really teach ETLs until you have the data problem. Small companies (I mean really small companies) and tutorials online can assume data is fine. Not until one is submerged in changing data does the ETL process start to make sense. So, though I wish I knew this earlier, I probably couldn’t have learned anyways!

4. How do you respond when you hear the phrase ‘big data’?

4. Sure, “Big Data” is a buzzword, but I think the issue with the name “Big data” comes down to two camps: are you seeing “Big data” as a solution (probably wrong) or as a problem (probably right). For example, two common questions an organization might have are 1) find the number of unique visitors to our site in the part month, and 2) find me the median of this dataset. If you data is simply too big for memory, which is a good definition of big data, then we can’t solve either of these problems naively. What is really interesting about big data as a problem is the abundance of cool new algorithms and data structures being invented to solve these problems. For example, HyperLogLog estimates the number of unique values in a set of data too big for memory. And TDigest estimates the percentiles of data too big for memory (and hence can’t be sorted).

5. What is the most exciting thing about your field?

5. I’ve already mentioned the interesting new algorithms for big data problems, so I won’t go over them again, but I do think they are very exciting. Another exciting thing the new problems being discovered, and the solutions being used. For example, the recommendation problem of what to recommend visitors to a site is a new problem that has massive impact, and is being solved by data. I can’t imagine Fisher or Pearson ever asking the question “what should I recommend next to this user?”. In a similar vein, we *are* seeing the reemergence of classical statistics again. Classical techniques like survival analysis, clinical trials, and logistic regression are seeing a major comeback because new problems have been identified.

6. How do you go about framing a data problem?

6. Honestly, I try to turn it into a binomial problem. I use the beta-binomial model as a large crutch far too often, but it’s a really good initial model of a problem. If I can turn the problem into a binomial problem, then I have lots of tools I can work with: Bayesian analysis, sample-size appropriate ranking techniques, Bayesian Bandits, etc. If I can’t turn it into a binomial problem, I go through the rest of my toolbox: survival analysis, lifetime value, Bayesian modeling, classification, association analysis, etc. If I still can’t find an appropriate solution, then I have to expand my scope (and often learn a new tool while doing that).

Source: https://peadarcoyle.com/2019/01/26/interview-with-a-data-scientist-cameron-davidson-pilon/

Before It’s News® is a community of individuals who report on what’s going on around them, from all around the world.

Anyone can join.
Anyone can contribute.
Anyone can become informed about their world.

"United We Stand" Click Here To Create Your Personal Citizen Journalist Account Today, Be Sure To Invite Your Friends.

Please Help Support BeforeitsNews by trying our Natural Health Products below!

Order by Phone at 888-809-8385 or online at https://mitocopper.com M - F 9am to 5pm EST

Order by Phone at 866-388-7003 or online at https://www.herbanomic.com M - F 9am to 5pm EST

Order by Phone at 866-388-7003 or online at https://www.herbanomics.com M - F 9am to 5pm EST

Humic & Fulvic Trace Minerals Complex - Nature's most important supplement! Vivid Dreams again!

HNEX HydroNano EXtracellular Water - Improve immune system health and reduce inflammation.

Ultimate Clinical Potency Curcumin - Natural pain relief, reduce inflammation and so much more.

MitoCopper - Bioavailable Copper destroys pathogens and gives you more energy. (See Blood Video)

Oxy Powder - Natural Colon Cleanser! Cleans out toxic buildup with oxygen!

Nascent Iodine - Promotes detoxification, mental focus and thyroid health.

Smart Meter Cover - Reduces Smart Meter radiation by 96%! (See Video).

Comments

Online:
Visits:	1,601,827,659
Stories:	8,145,599