Ask Ghassem - Recent questions tagged data-science

How to analyse imbalanced categorical colum in dataset

Sat, 24 Jun 2023 17:55:23 +0000

Hello,

I have a dataset with a categorical column that contains three categories. One of the categories represents 98% of the data, while the remaining 2% are distributed between the other two categories, with a few (maybe around 50) in each. It is worth mentioning that the output for these 50 rows is the same, which suggests that these data points may be important.

However, the data is obviously imbalanced, and I am unable to perform any analysis. Should I drop the entire column, or perform a chi-square test on the data as-is?

Can you verify the validity of this chart comparing the review scores for Marvel Phase 4?

Mon, 09 Jan 2023 16:29:14 +0000

I have some skepticism about the validity of the charts below comparing the critic and audience reviews for Phase 4 of the MCU to the previous 3 phases. There are over 18 movies and tv shows in Phase 4 compared to the 6 movies in Phases 1 & 2 and the 11 movies in Phase 3. Also, there are far fewer critic reviews for the Phase 4 tv shows than the Phase 4 movies. For example, on Rotten Tomatoes there are only 40 critic reviews for The Falcon and the Winter Soldier and 452 critic reviews for Black Widow. Could this uneven and inconsistent number of reviews between tv shows and movies in Phase 4 be inaccurately making the overall averages higher than they should be? Or do you agree with the conclusions presented in the charts?

https://cdn.discordapp.com/attachments/997145183172964435/1059948060194652230/image.png

https://cdn.discordapp.com/attachments/997145183172964435/1049356020469739520/image.png

Creating tables from unstructured texts about stock market

Tue, 02 Aug 2022 00:47:49 +0000

I am trying to extract information such as profits, revenues and others along with their corresponding dates and quarters from an unstructured text about stock market and convert it into a report in the table form but as there is not format of the input text, it is hard to know which entity belong to what date and quarters and which value belong to which entity. Chunking works on few documents but not enough. Is there any unsupervised way to linking entities with their corresponding dates, values and quarters?

How do I compare the count of a value in each year while having a different sanple size each year.

Wed, 08 Jun 2022 10:32:33 +0000

How do I accurately compare between the number of something a survey measure from my employees each year with a varying umber of survey engagement and employee size?

If I was measuring the satisfaction of my employees over the years by collecting a survey from my them each year by asking them wether they are satisfied or not, and then comparing yes’s over the years but the number of employees who answer is not the same each year and the number of employees increases every year. How do I correctly compare this throughout each year?

In other words, how do I remove the effect of the survey engagement rate when calculating the results?

Is it possible to make a forecast of a future value of Air Temperature using Fast Fourier Transform?

Thu, 02 Jun 2022 16:10:26 +0000

Is it possible to make a forecast of a future value of Air Temperature using Fast Fourier Transform, if yes, what should be the process or how you'll be able to do it. Thank you!

forecast log transformed fitted values for 2 years using ARMA model

Wed, 04 May 2022 20:31:44 +0000

Input is a stock price in exponential transformation. We are asked to forecast using ARMA results for 2 years.

Bankruptcy prediction and credit card

Sun, 10 Apr 2022 05:50:14 +0000

Hello everyone newbie data scientist here.
I'm working on a project to predict companies (probability of default) bankruptcy probability and to assign them a credit rating/score based on that :
For example below 50 probability is good and above is bad ( just for the example)
I have a dataset contains financial ratios and a class refers if the company is bankrupted or not (0 and one).
I'm planning to use this models:
Logistic regression linear discrimination analysis, decision trees, random forest, ANN, adaboost, Svm.

The question is and i know it is a dumb question:
Does those models return a probability? Which i can transform to labels, I saw that in a thesis and I'm not sure about it.

Otherwise, any guidance,tips anything will be appreciated.

I cannot get this code to work. please help.

Mon, 21 Mar 2022 05:59:53 +0000

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.model_selection import train_test_split

model = Sequential()
model.add(LSTM( 10, input_shape=(1, 1)))
model.add(Dense(1, activation="linear"))
model.compile(loss="mse", optimizer="adam")

X, y = get_data()

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
X_train_2, X_val, y_train_2, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

model.fit(X_train, y_train, epochs=800, validation_data=(X_val, y_val), shuffle=False)

html, body, table, thead, input, textarea, select {color: #bab5ab!important; background: #35393b;} input[type="text"], textarea, select {color: #bab5ab!important; background: #35393b;} [data-darksite-inline-background-image-gradient] {background: linear-gradient(rgba(0, 0, 0, 0.5), rgba(0, 0, 0, 0.5))!important; -webkit-background-size: cover!important; -moz-background-size: cover!important; -o-background-size: cover!important; background-size: cover!important;} [data-darksite-force-inline-background] * {background-color: rgba(0,0,0,0.7)!important;} [data-darksite-inline-background] {background-color: rgba(0,0,0,0.7)!important;} [data-darksite-inline-color] {color: #fff!important;} [data-darksite-inline-background-image] {background-image: linear-gradient(rgba(0,0,0,0.3), rgba(0,0,0,0.3))!important}

Battery data projects

Wed, 02 Mar 2022 18:11:57 +0000

Where can I find projects related to battery data?

How can you build dynamic pricing model with data only from rigid pricing?

Fri, 21 Jan 2022 06:44:31 +0000

I want to build a dynamic pricing model which means if product is too expansive for a client and there is a risk that we might loose a client we lower the price for them but if client doesn't care that much about the price we might increase price a little.

All the articles I've seen describe some kind of A/B testing for the pricing and then create a model.

I want to build a model only on the existing rigid pricing data. So I have prices offered to customers and I know who bought the product and who went to other company.

How can I do the increasing price part?

Do you usually collect you own data or there is always a resource available for you? Or it depends on the company?

Sun, 09 Jan 2022 22:13:34 +0000

When dealing with categorical values, should the 'year' column be encoded using OHE or OrdinalEncoder?

Sat, 18 Dec 2021 18:46:07 +0000

It's a car prices dataset, and so I'm assuming that the more recent the more value a car should have. The values in the 'year' column simply consist of years from 1995 to 2020.
I am trying to predict the selling price of the car.

I'm a bit new to ML, currently still doing my undergraduate so any help / tips are appreciated. Thank you.

How do I know which encoder to use to convert from categorical variables to numerical?

Mon, 29 Nov 2021 04:09:06 +0000

So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.

I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other 'converters' (not sure if that's the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).

How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I'd greatly appreciate it.

Text Mining, Artificial Neural Networks, Speech Processing, Cloud Computing in DS? Essential for a good Data Scientist ?

Wed, 27 Oct 2021 19:15:16 +0000

should i start as a data analyst then data science?

Mon, 21 Jun 2021 20:31:04 +0000

should I start as a data analyst then data science?

I am a second-year Bachelor's in Computer Science and wanted to pursue to be a Data Scientist.

However, when I am trying to apply for internships/jobs, most of it requires a Masters's/Ph.D.

But, a Data Analyst has fewer requirements.

Do you recommend starting off as a Data Analyst and then change to Data Science?

How to calculate average with deviating sensors?

Tue, 04 May 2021 14:39:14 +0000

In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for recalibration is to be neglected. I'm in need of an (excel) formula looking at three columns which row-by-row detects a significant deviation compared to the others and calculate the average of the most trustworthy.
Example:
48.1 ; 45.2 ; 45.4 => 45.3, as sensor 1 is way off....
36.0 ; 37;0 ; 45.0 => 36.5, as sensor 3 is way off....
36.0 ; 36;5 ; 37.0 => 36.5 as the deviation is too small to be considered an anomaly, so all values are valid to create the average.

Working with long periods of time.. the readings might be trustworthy for a few weeks, but in defect from moment X up until now... so simply ruling out one sensor is not really an option either.. What is the best way forward?
Please help. Highly appreciated.

design a computer-based system that will encourage autistic children to communicate and express themselves better.

Thu, 01 Apr 2021 07:04:59 +0000

a) A company has been asked to design a computer-based system that will encourage autistic children to communicate and express themselves better.

b) What type of interaction would be appropriate to use at the interface for this particular user group?

guidance on sequencing data science courses below

Fri, 20 Mar 2020 13:55:49 +0000

Hello
my name is lutaaya mudathiru.

I am planning to start data science online

professional courses at Harvard

University, but i don't know which course i should begin with . I request for help in sequencing these courses below so that i can

benefitt more:

1. Principles, Statistical and Computational Tools for Reproducible Science.

2.Data Science: Inference and Modeling.

3. Data Science: Productivity Tools

4.Data Science: Wrangling

5.Data Science: Linear Regression.

6.Data Science: Machine Learning

7.Data Science: Capstone

8. Data Science: R Basics

9.DataScience:Visualization

10. DataScience:Probability.

11. High-Dimensional Data Analysis

12. Introduction to Linear Models and Matrix Algebra

13. Data science:Statistics and R

14. Fat Chance: Probability from the Ground Up

15. Introduction to Probability (on edX)

What are the differences among Data Science, Artificial Intelligence and Machine Learning?

Thu, 05 Mar 2020 03:02:31 +0000

What are the differences among Data Science, Artificial Intelligence and Machine Learning?

What are the most common data types in data science?

Wed, 19 Feb 2020 17:28:54 +0000

What are the main data types?

How can I prove that following is a tautology (using laws of logical equivalences)

Sat, 15 Feb 2020 17:54:02 +0000

$(p → q) ∧ (q → r) → (p → r)$

Without Truth Tables!

How to filter a dataframe?

Wed, 25 Dec 2019 05:56:14 +0000

Consider the Pandas DataDrame df below. Filter it appropriately so that it outputs the shown results.

     gh owner language      repo  stars
0  pandas-dev   python    pandas  17800
1   tidyverse        R     dplyr   2800
2   tidyverse        R   ggplot2   3500
3      has2k1   python  plotnine   1450

Expected Output

     gh owner language    repo  stars
0  pandas-dev   python  pandas  17800

Can anyone please solve Q7 pg 329 in Induction?

Thu, 05 Dec 2019 22:02:15 +0000

Individual and group relative strength in a fixed pool of players: How to approach the problem?

Tue, 29 Oct 2019 20:00:28 +0000

I apologize in advance if my question sounds too basic to be worthy of anyone's time, but statistics are not part of my curriculum.

I am developing a proof of concept of a web application modeling the contribution of individual soccer player with respect to the different teams they've played with throughout their career. In particular, I am looking into a way of ranking both individuals and groups of players as follows::

teammates relative strength: the best/worst combinations of players when playing in the same team in the same matches;
opponents relative strength: the best/worst combinations of players when playing in opposite teams in the same matches, i.e. which tuples of teammates are the best/worst against which;

I must admit I don't quite know how to approach the problem (as I said I have no formal education in statistics or data science). I would be very grateful if anyone could give me some directions. How should I frame this particular problem and what resources in statistics or machine learning (if indeed this is a task fit for machine learning, perhaps I am mistaken on this) would be appropriate to tackle it?

I am eager to learn, so both practical examples or theoretical references (book chapters, online articles, etc) would be very welcome.

Thanks in advance!

ideas and opinion on what kind of analyses needs to be done

Fri, 26 Jul 2019 18:38:55 +0000

Hello Guys,

I have been given a task where I need to analyse the impact of all the products that were discontinued last year on the customers and sales? I work for a retail company which has 100s of stores in North America and my analyses needs to be "Qualitatively better than the regular BI".

I quote that because thats what I was asked to do. I am struggling to come up with ideas. What kind of hypothesis should I test. Or any Bayesian analyses that can be done. Any ideas within the realms of data science and machine learning is welcome.

I generally do my analyses using Jupyter Notebook and my skill set consist of SQL, Python including ML libraries like sklearn

What are the most important Python libraries for data science?

Mon, 08 Jul 2019 04:42:28 +0000

Using aggregate data to generate observation-level data statistically sound?

Tue, 11 Jun 2019 22:04:01 +0000

Context: In the realm of Paid Search Marketing. Current reporting does not provide event level data only aggregate totals with different segments. Want to compare distributions/test statistical significance of A/B test results. Did not want to assume that data followed normal distribution or know STDEV for data so came with this approach.

My Question: I am going to use the average "CPA" or "CTR" for a date range, and generate an observation for each conversion based off the average for a time range. Is this statistically sound way if I want to generate raw data? Would I have wonky distributions because of the multiple averages? Just want a gutcheck if I'm completely off base.

My Aggregate data looks like below:

Day	Cost	Acquisition	CPA or CTR
1	40	2	$20
2	75	3	$25

Observation data I generate looks like below:

Day	Acquisition
1	$20
1	$20
2	$25
2	$25
2	$25

I really appreciate your help with this question! An important project to me at work.

What are the available libraries for continuous time hidden markov models ?

Fri, 07 Jun 2019 13:27:05 +0000

Is digital marketing and marketing internships worth it for a data science student?

Thu, 04 Apr 2019 22:58:54 +0000

it's about data science career

How do I reduce RMSE in a Random Forest Regressor?

Wed, 03 Apr 2019 12:24:09 +0000

I preprocessed the data, normalized the numerical features, and did one hot encoding for the categorical ones. I end up with a model with R^2=0.7 and RMSE which is 15% of the range of values.
I'm okay with the accuracy but I was wondering if there's a way to reduce RMSE to maybe ~7%?

Let me know please.

Thanks!

Determine weights on the paths that connect to the different data points in a neural network?

Mon, 18 Mar 2019 23:35:25 +0000

How do you determine the weight values that connect to the other data points when solving for our output in neural networks?

I fit a many different models to my data set and I still get horrible accuracy. Any suggestions?

Thu, 07 Mar 2019 17:53:00 +0000

I'm trying to create a regression with a data set that does not contain any nulls and has a very few outliers. I fit a linear regression, a random forest, and a gbm model but they all have terrible accuracy.

Any suggestions on how to move on from this point? Feeling like I've hit a road block.

Thanks!

How do you visualize/present the predicted output of a random forest model with large trees?

Thu, 28 Feb 2019 17:10:03 +0000

I’ve heard that it’s hard to visualize the output of random forest models with large trees/forest but I’m finding it hard to understand what the use case for the model is, if you can’t visualize the outputs? How do you use the predictions then? Is there a way to visualize this?

Do you have a cheatsheet for Data Science?!

Wed, 27 Feb 2019 05:51:00 +0000

What is degree of Freedom while calculating confidence interval?

Fri, 07 Dec 2018 22:27:27 +0000

What is 'Degrees of Freedom'?

Thu, 29 Nov 2018 18:49:06 +0000

What is 'Degrees of Freedom'?

What is the difference between univariate and multivariate analysis?

Tue, 30 Oct 2018 11:39:08 +0000

How will you create a classification to identify key customer trends in unstructured data?

Sun, 28 Oct 2018 11:45:43 +0000

Explain the typical data analysis process.

Sun, 28 Oct 2018 11:43:46 +0000

What is the difference between Data Mining and Data Analysis?

Sun, 28 Oct 2018 11:42:45 +0000

Which scenarios among the following are a valid reason to use regularization?

Sat, 27 Oct 2018 17:31:43 +0000

A. To drop the least useful variables of a model

B. To reduce over-fitting

C. To reduce the bias of a model

D. To decrease p-value

What are basic steps for treating missing values?

Fri, 19 Oct 2018 04:08:48 +0000

What are the general steps in data cleaning?

Fri, 19 Oct 2018 03:54:38 +0000

What is k-means algorithm and how can we select K for it?

Mon, 15 Oct 2018 05:49:41 +0000

Explain the Recommender Systems and give an example?

Mon, 15 Oct 2018 05:47:14 +0000

What are the main steps in making a decision tree?

Fri, 12 Oct 2018 02:17:19 +0000

Please explain Linear Regression with an example?

Fri, 12 Oct 2018 02:14:00 +0000

How to return the outliers by having a list of numbers ?

Mon, 08 Oct 2018 12:19:22 +0000

What are Natural Language Processing (NLP) and its applications?

Mon, 08 Oct 2018 11:59:52 +0000

What is TF-IDF algorithm?

Mon, 08 Oct 2018 11:57:39 +0000