In this tutorial, I illustrate how to build a dataset from the text. As an example I consider a birth register, which contains the following text:
On August 21 1826 a son was born to John Bon and named him Francis.
On June 11 1813 a daughter was born to James Donne naming her Mary Sarah.
On January 1 1832 a son was born to his father David Borne and named him John.
Each row of the document contains a birth registration. All the birth registrations have almost the same structure, although they differ in some details. …
As data scientists, we are constantly told that we need to understand machine learning because it is one of the tools that lets us do our job. I understand that many newbies in the field are learning machine learning without a deeper understanding of the concept and the equation — only relying on using the algorithm.
The most important base of understanding machine learning is math knowledge. When you hear math, it will inevitably remind you of high school lessons — hard, confusing, and theoretical. Machine learning math surely is similar, but in this modern era, it’s different from the…
When coding in real life, I sometimes forget syntax and need to resort to Google. Sadly, this luxury isn’t available during coding interviews. To address this, I’ve been reviewing common syntax patterns in Python for coding interviews. Syntax isn’t as important as understanding core algorithms and data structure concepts in the first place, but for me, reviewing syntax instills confidence in my code and saves me invaluable time. I hope it does the same for you.
sorted(numbers)will return the sorted numbers in ascending order and leave the original numbers unchanged. You can also use
There are a lot of engineers who have never been involved in the field of statistics or data science. But to build data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I’ll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.
The whole series:
Defining the type of variable you are working with is always the first step…
WeightWatcher is based on theoretical research (done in joint with UC Berkeley) into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.
Are your models over-trained? The weightwatcher tool can detect the signatures of overtraining in specific layers of pre/trained Deep Neural Networks.
In the Figure above, fig (a) is well trained, whereas fig (b) may be over-trained. That orange spike on the far right is the tell-tale clue; it’s what we call a Correlation Trap.
Weightwatcher can detect the signatures of overtraining…
“I thought AlphaGo was based on probability calculations, and it was merely a machine. But when I saw this movie, I changed my mind. Surely AlphaGo is creative. The move was really creative and beautiful” (Alpha Go documentary, 52:10–52:40).
In this quote, Lee Sedol, the greatest player to ever touch the game of Go, reacted to the infamous move 37 in one of his games against the reinforcement learning agent AlphaGo.
This highlights the kind of magical aura that surrounds machine learning and especially deep learning. …
Google Sheets is a very powerful (and free) tool for creating spreadsheets. I’ve almost replaced LibreOffice Calc with Sheets because it’s very comfortable to work with. Sometimes, a data scientist has to pull some data from a Google Sheet into a Python notebook. In this article, I’ll show you how to do it using just Pandas.
The first thing to do is to create a Google Sheet. For this example, it will contain just 2 columns, one of which (the Age) has one missing value.
This is the dataset we’re going to work with.
Now we have to make it…
Every Data Scientist spends most of his time in data visualization, preprocessing and model tuning based on the results. These are the toughest situations for every Data Scientist because you will get a good model when you perform all these three steps precisely. There are 10 very helpful jupyter notebook extensions to help in these circumstances.
Qgrid is a Jupyter notebook widget that uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your DataFrames with intuitive scrolling, sorting and filtering controls, as well as edit your DataFrames by double-clicking cells.
If you work in a scientific field, you should try to build a deep and unbiased understanding of that field. This not only educates you in the best possible way but also helps you envision the opportunities in your space.
A research paper is often the culmination of a wide range of deep and authentic practices surrounding a topic. When writing a research paper, the author thinks critically about the problem, performs rigorous research, evaluates their processes and sources, organizes their thoughts, and then writes. These genuinely-executed practices make for a good research paper.
If you’re struggling to build a…
Data structures and algorithms are some of the most essential topics for programmers, both to get a job and to do well on a job. Good knowledge of data structures and algorithms is the foundation of writing good code.
If you are familiar with essential data structures e.g. array, string, linked list, tree, map, and advanced data structures like tries, and self-balanced trees like AVL trees, etc., you’ll know when to use which data structure and compute the CPU and memory cost of your code.
Even though you don’t need to write your own array, linked list, or hashtable, given…
Developer, Data Analyst & Trying to explore the Best Version of Myself ❤