Recently, I have been working on building a data pipeline to transfer the data from a relational database management system (RDBMS) to the Hadoop ecosystem (HDFS). With limited time resources, I have to finish the data pipeline within seven days from scratch. Also, I got the suggestion that it should complete the entire thing within three days.
At that point in time, my knowledge about the Hadoop ecosystem configuration and docker was nearly nothing. I know the concept of what a distributed system is, but to configure it by myself was almost impossible. It is easier to scare from what…
Reporting is the foundation of any business. In daily life, you somehow have to ingest the new data from a report to decide where to go next every day. The report can come in various formats such as Microsoft Excel, Web application, or exporting from an enterprise resource planning system (ERP).
I have recently got a request to build a dashboard that replicates the business number in the crafted report. The finance team manually created this report monthly. The most tedious process is to export the source files from the SAP system and manually place it in excel. …
As a data scientist/analyst, your job is to produce a report that contained many insights for business decisions. A report can be made by several useful tools such as Microsoft Excel, SAP, or customized with the programing language such as SAS, R, or Python. The result can be sent through internal email to a stakeholder or publish through the centralized dashboard.
Like everyone else, I am a data analyst who uses python for making a report or presentation in daily life. My usual assignment is to make an ad-hoc analysis within 2–3 hours to present to the management team.
We all have limited time in our life.
24 hours a day is relatively short if you have many things to achieve. We all dream about a productive life to get whatever we want to be done with ease.
However, life is not that easy and sends so many distractions to you, especially in 2020.
We all have social media, entertainment platform, online publications, etc., in our hands.
We can spend a day on it without bored. This is quite a difference compared to ten to twenty years ago.
The more time we spend on those distractions, the less number…
Machine learning pipeline is an essential part of data application. We build it to transform the raw data into an insightful prediction. The pipeline contains many steps such as data ingestion, data preprocessing, feature engineering, model fitting, and performance evaluation.
When data scientists start developing the ML pipeline, they try to build the whole pipeline fast and re-iterate the process by changing some hyper-parameter to get the best result. There are many hyper-parameters to tweak in this process.
It would be best if we can track the variation of those hyper-parameters. We will gain a deeper understanding of our ML…
Learning new trends from watching Korean Netflix’s series.
Spoiler alert: this article may contain information about this drama. Please feel free to skip it first if you have not watched it yet. But, if you don’t mind, let’s dive in!
Recently, I have watched the Netflix series called STARTUP. It’s a Korean drama that is on-air every SAT and SUN at 9 PM. The story is about a group of people who dream of establishing a startup business on their own.
Seem straightforward and not interested, right?
But, the exciting part is that the main character of this series is…
Data analytics, science, and engineering have grown much popularity in the last few years. It creates a new standard for the industry. Every company needs to invest or establish a data office within their organization.
It becomes standard in 2020 that you can have a prediction model for marketing leads, improving your check-in method with facial recognition., or looking at the elegant dashboard for making a business decision.
Exceptional use cases always come first to build the momentum of the analytics trend. Executives want to see a result before investing a massive amount of funds into a new direction.
I point out the importance and data quality issues in the previous article.
The quicker you realize the problem with your data, the better you can deliver a valid conclusion to drive the business.
When you have limited time to do the analysis, I hope this tutorial helps you like a checklist for ensuring the data condition before presenting to the audience.
Today I will show you the
code snippet for checking the data condition. The topics will cover units of analysis, missing values, duplicated records, Is your data makes sense, and truth changing over time.
The tutorial will be…
Hi everybody who is on the screen right now.
If you click on this story, I would like to thank you for reaching this page.
My name is Pathairush Seeda, or you can call me PAT. I was born in Thailand, and now I’m 28 years old. I’m a little brother from a Thai family.
I’m now working as a data scientist/engineer in Thailand's top 50 listed companies.
Also, I have been writing a Medium since October 2020. …
Time is limited, you have to spend it wisely
In the working world, everyone is in a rush. For the company's high-level executive, their calendar has been filled with a lot of important meetings.
Your 1-month project has to wrap up and present to them within 30 minutes or less. You have to give them all the needed information for a decision.
Everything has to be well prepared.
There is no room for any struggle, confusion, and ambiguity. The presentation deck needs to be clear and precise enough to move forwards with any actions.
How could I make…