Ask Ghassem - Recent questions in Data Science

How to analyse imbalanced categorical colum in dataset

Sat, 24 Jun 2023 17:55:23 +0000

Hello,

I have a dataset with a categorical column that contains three categories. One of the categories represents 98% of the data, while the remaining 2% are distributed between the other two categories, with a few (maybe around 50) in each. It is worth mentioning that the output for these 50 rows is the same, which suggests that these data points may be important.

However, the data is obviously imbalanced, and I am unable to perform any analysis. Should I drop the entire column, or perform a chi-square test on the data as-is?

Can you verify the validity of this chart comparing the review scores for Marvel Phase 4?

Mon, 09 Jan 2023 16:29:14 +0000

I have some skepticism about the validity of the charts below comparing the critic and audience reviews for Phase 4 of the MCU to the previous 3 phases. There are over 18 movies and tv shows in Phase 4 compared to the 6 movies in Phases 1 & 2 and the 11 movies in Phase 3. Also, there are far fewer critic reviews for the Phase 4 tv shows than the Phase 4 movies. For example, on Rotten Tomatoes there are only 40 critic reviews for The Falcon and the Winter Soldier and 452 critic reviews for Black Widow. Could this uneven and inconsistent number of reviews between tv shows and movies in Phase 4 be inaccurately making the overall averages higher than they should be? Or do you agree with the conclusions presented in the charts?

https://cdn.discordapp.com/attachments/997145183172964435/1059948060194652230/image.png

https://cdn.discordapp.com/attachments/997145183172964435/1049356020469739520/image.png

How do I compare the count of a value in each year while having a different sanple size each year.

Wed, 08 Jun 2022 10:32:33 +0000

How do I accurately compare between the number of something a survey measure from my employees each year with a varying umber of survey engagement and employee size?

If I was measuring the satisfaction of my employees over the years by collecting a survey from my them each year by asking them wether they are satisfied or not, and then comparing yes’s over the years but the number of employees who answer is not the same each year and the number of employees increases every year. How do I correctly compare this throughout each year?

In other words, how do I remove the effect of the survey engagement rate when calculating the results?

Is it possible to make a forecast of a future value of Air Temperature using Fast Fourier Transform?

Thu, 02 Jun 2022 16:10:26 +0000

Is it possible to make a forecast of a future value of Air Temperature using Fast Fourier Transform, if yes, what should be the process or how you'll be able to do it. Thank you!

forecast log transformed fitted values for 2 years using ARMA model

Wed, 04 May 2022 20:31:44 +0000

Input is a stock price in exponential transformation. We are asked to forecast using ARMA results for 2 years.

Battery data projects

Wed, 02 Mar 2022 18:11:57 +0000

Where can I find projects related to battery data?

How can you build dynamic pricing model with data only from rigid pricing?

Fri, 21 Jan 2022 06:44:31 +0000

I want to build a dynamic pricing model which means if product is too expansive for a client and there is a risk that we might loose a client we lower the price for them but if client doesn't care that much about the price we might increase price a little.

All the articles I've seen describe some kind of A/B testing for the pricing and then create a model.

I want to build a model only on the existing rigid pricing data. So I have prices offered to customers and I know who bought the product and who went to other company.

How can I do the increasing price part?

What analytical software would be good for a company to use?

Fri, 14 Jan 2022 16:46:38 +0000

This would be for a company that is just now looking into using a software to track data for wine making.

How do I know which encoder to use to convert from categorical variables to numerical?

Mon, 29 Nov 2021 04:09:06 +0000

So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.

I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other 'converters' (not sure if that's the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).

How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I'd greatly appreciate it.

ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements

Fri, 26 Nov 2021 06:09:16 +0000

I'm creating a new data frame with the most used items grouped together. But I got the following error when grouping through ID and items. ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements.

df = sales_df[sales_df['shop_id'].duplicated(keep=False)]
df['Grouped'] = sales_df.groupby('shop_id')['item_name'].transform(lambda x: ','.join(x))
df2 = df[['shop_id', 'Grouped']].drop_duplicates()

In the aforementioned code, I'm making a data frame with respect to shop id and then grouping through shop items. My objective here is to group items with similar ID.

Text Mining, Artificial Neural Networks, Speech Processing, Cloud Computing in DS? Essential for a good Data Scientist ?

Wed, 27 Oct 2021 19:15:16 +0000

Classification of data object might be incorrect

Mon, 25 Oct 2021 15:26:46 +0000

I am learning a new Salesforce product (Evergage) for the company I work for. In the program's documentation they have listed a set of data objects as an example. It appears to me that the classification might be incorrect. Their system makes a division between 'catalog objects' and 'profile objects' and the example they have given is a banking institution. They classified Customer Credit Card as a profile object and Credit Card Level as a catalog object. Seems to me that it should be the other way i.e Customer Credit Card = catalog object and Credit Card Level = profile object. Maybe I am not reading the context correctly?

here is a link to an image with the complete classification: https://drive.google.com/file/d/1nG4aX4Ty_NoHxm04AQo1Ow61m3MZ3pXm/view?usp=sharing

Can Data Science solve this problem?

Sun, 24 Oct 2021 15:43:11 +0000

So, I live in Brazil, and I have a task for college that I don't know what data science method to use, if at all, to solve it. My idea is the following: We Brazilians have Real (BRL) as currency, and we of course have the dollar quotation value to see "how many Reais a dollar is worth". What I wanted to do was to make a research and see whether the Country News have any influence over this price. So for example, if Bolsonaro, our president, says some dumb stuff, the dollar got up in price, and vice versa. What I wanted to do was collect all dollar values and variance over a set time interval, and try and get webscraping to get the news over some economy sites. Here's my question then: How can I correlate the news with the dollar variance over a set time? Can data science do that? How do I preprocess this, if at all? Do I need to use bag-of-words? At least I heard so... Please help and thank you for reading.

should i start as a data analyst then data science?

Mon, 21 Jun 2021 20:31:04 +0000

should I start as a data analyst then data science?

I am a second-year Bachelor's in Computer Science and wanted to pursue to be a Data Scientist.

However, when I am trying to apply for internships/jobs, most of it requires a Masters's/Ph.D.

But, a Data Analyst has fewer requirements.

Do you recommend starting off as a Data Analyst and then change to Data Science?

How best to ensure data quality?

Tue, 08 Jun 2021 22:02:23 +0000

Searching for movie dataset containing movie synopses/plots?

Thu, 27 May 2021 09:57:31 +0000

Hello
To build a hybrid recommendation system, I used the movielens 1M dataset, for the collaborative filtering part. Now, I'm looking for a database/dataset that contains descriptions/summaries/details/synopses/plots of movies for the content-based recommendation.
Is there someone who could help me and tell me where I can find a such dataset?
thank you in advance.

How to calculate average with deviating sensors?

Tue, 04 May 2021 14:39:14 +0000

In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for recalibration is to be neglected. I'm in need of an (excel) formula looking at three columns which row-by-row detects a significant deviation compared to the others and calculate the average of the most trustworthy.
Example:
48.1 ; 45.2 ; 45.4 => 45.3, as sensor 1 is way off....
36.0 ; 37;0 ; 45.0 => 36.5, as sensor 3 is way off....
36.0 ; 36;5 ; 37.0 => 36.5 as the deviation is too small to be considered an anomaly, so all values are valid to create the average.

Working with long periods of time.. the readings might be trustworthy for a few weeks, but in defect from moment X up until now... so simply ruling out one sensor is not really an option either.. What is the best way forward?
Please help. Highly appreciated.

Terminology clarification in Spark

Sat, 06 Feb 2021 18:03:32 +0000

I have a hard time distinguishing terminologies of SparkSQL. While SparkSQL are quite flexible in terms of abstraction layers, its really difficult for beginner to navigate around those options.

1. When we say " using SparkSQL to perform .....", does it mean that we can use any API/abstraction layers such as Scala, Python, HiveQL to query? As long as the core dataframe is in spark, we should be fine?

2. Can we manipulate data in both PySpark and Scala sequentially?

For example, may I clean up the data in Scala, then perform follow up manipulation in PySpark, then go back to Scala?

3. As demonstrated in the tutorial, we can query with SQL command by using the api spark.sql("My SQL command"). does it count as SQL or SPARK?

My GloVe word embeddings contain sentiment?

Sun, 03 Jan 2021 14:09:37 +0000

I've been researching sentiment analysis with word embeddings. I read papers that state that word embeddings ignore sentiment information of the words in the text. One paper states that among the top 10 words that are semantically similar, around 30 percent of words have opposite polarity e.g. happy - sad.

So, I computed word embeddings on my dataset (Amazon reviews) with the GloVe algorithm in R. Then, I looked at the most similar words with cosine similarity and I found that actually every word is sentimentally similar. (E.g. beautiful - lovely - gorgeous - pretty - nice - love). Therefore, I was wondering how this is possible since I expected the opposite from reading several papers. What could be the reason for my findings?

Two of the many papers I read:

Yu, L. C., Wang, J., Lai, K. R. & Zhang, X. (2017). Refining Word Embeddings Using Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 671-681.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T. & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1: Long Papers, 1555-1565

is it possible to derive a new 95% CI from two separate 95% CIs?

Mon, 23 Nov 2020 14:45:19 +0000

Probability of a bus arrived in its destination based on weather condition

Mon, 09 Nov 2020 13:06:47 +0000

A bus is making its way to a destination. If the weather conditions are favorable today, the likelihood of delay is 3%. If the weather conditions are not favorable today, the likelihood of delay is 50%. The forecast predicts that it is 20% likely that the weather conditions will be favorable today.

1. What is the likelihood that the bus will be delayed?

2. The bus has arrived, but it was delayed. Given that the bus was delayed, what is the likelihood that the weather conditions were favorable?

How to remove unwanted Jupyter notebook kernels?

Fri, 30 Oct 2020 17:15:17 +0000

Whener I run Jupyter notebook there are some kernels that do not exist on system and generate errors. How can I remove them?

How can this data be structured for mongodb

Sun, 19 Jul 2020 19:08:50 +0000

https://prnt.sc/tkr2g7 Hello I have a PFE about determining risks of pedestrians, and I have to make a simulator to generate data with something related to this, this is my first time working on this. I would like to know, the structure of data, I will be working with mangodb, so I would love to see an example on JSON

guidance on sequencing data science courses below

Fri, 20 Mar 2020 13:55:49 +0000

Hello
my name is lutaaya mudathiru.

I am planning to start data science online

professional courses at Harvard

University, but i don't know which course i should begin with . I request for help in sequencing these courses below so that i can

benefitt more:

1. Principles, Statistical and Computational Tools for Reproducible Science.

2.Data Science: Inference and Modeling.

3. Data Science: Productivity Tools

4.Data Science: Wrangling

5.Data Science: Linear Regression.

6.Data Science: Machine Learning

7.Data Science: Capstone

8. Data Science: R Basics

9.DataScience:Visualization

10. DataScience:Probability.

11. High-Dimensional Data Analysis

12. Introduction to Linear Models and Matrix Algebra

13. Data science:Statistics and R

14. Fat Chance: Probability from the Ground Up

15. Introduction to Probability (on edX)

What are the differences among Data Science, Artificial Intelligence and Machine Learning?

Thu, 05 Mar 2020 03:02:31 +0000

What are the differences among Data Science, Artificial Intelligence and Machine Learning?

How to convert Jupyter Notebook or a webpage to PDF using Chrome?

Thu, 27 Feb 2020 18:39:19 +0000

Please show us how to print to pdf in Google Colab or any other webpages in Google Chrome?

How to share a Jupyter Notebook document on Google Colab?

Thu, 27 Feb 2020 16:17:13 +0000

How can I give access to others to view, comment or edit my Jupyter Notebook on Google Colab?

What are the most common data types in data science?

Wed, 19 Feb 2020 17:28:54 +0000

What are the main data types?

How can I prove that following is a tautology (using laws of logical equivalences)

Sat, 15 Feb 2020 17:54:02 +0000

$(p → q) ∧ (q → r) → (p → r)$

Without Truth Tables!

can someone send me an online link to discrete mathematics 8th edition textbook?

Thu, 13 Feb 2020 00:22:28 +0000

How to install Matplotlib

Sun, 02 Feb 2020 21:30:41 +0000

Hi Everyone,

How to install Matplotlib in Python IDLE. 3.7.4

I'm a student in Data Science and I want to do The Birthday Paradox and it requires me to import matplotlib.pyplot

I'm using MAC and in the terminal, I tried to write:

(base) MacBook-Pro:~ nasir$ sudo apt install python3-matplotlib

Password: **************

I then received the below error:

Unable to locate an executable at "/Library/Java/JavaVirtualMachines/jdk-11.0.1.jdk/Contents/Home/bin/apt" (-1)

(base) MacBook-Pro:~ nasir$

Understanding symbolic language of problem, quantificational logic

Sun, 26 Jan 2020 11:18:05 +0000

Hi, i am having trouble interpreting the information contained in the relation R, and how it should be applied to the Ps in this problem:

Consider the formula

∃x∃y∃z(P(x,y)∧P(z,y)∧P(x,z)∧¬P(z,x))

Under each pf these interpretations, is this formula true? In each case, R is the relation corresponding to P.

(a) U = N, R = {<x,y> : x<y}.

(b) U = N, R = {<x,x+1> : x≥0}.

Does <x,y> refer to the variables x,y or z in each P(a,b), and the :x<y refer to what the relation between these two should be?

I tried something like this for (a) and got:

∃x∃y∃z((x<y)∧(z<y)∧(x<z)∧¬(z<x))

However I'm not sure if this is correct, and I'm not sure how I would do it for (b)

What are the values stored in tc8 format for the following numbers?

Thu, 23 Jan 2020 19:27:01 +0000

Q1) Calculate tc8 of -117, -127, 127, 0

Q2) What values are stored in the following tc8 registers:

10010001
00010001
01111111
0000000
11111111

How to calculate $g[h(3)]$ if $g(x)= 2x+3$ and $h(x) =4x+5$ ?

Fri, 06 Dec 2019 00:47:26 +0000

Need help solving this question

How do i find the inverse of the function?

Thu, 05 Dec 2019 23:10:55 +0000

If $h(x) =4x+5$, how do i find h inverse of x?

Can anyone please solve Q7 pg 329 in Induction?

Thu, 05 Dec 2019 22:02:15 +0000

Individual and group relative strength in a fixed pool of players: How to approach the problem?

Tue, 29 Oct 2019 20:00:28 +0000

I apologize in advance if my question sounds too basic to be worthy of anyone's time, but statistics are not part of my curriculum.

I am developing a proof of concept of a web application modeling the contribution of individual soccer player with respect to the different teams they've played with throughout their career. In particular, I am looking into a way of ranking both individuals and groups of players as follows::

teammates relative strength: the best/worst combinations of players when playing in the same team in the same matches;
opponents relative strength: the best/worst combinations of players when playing in opposite teams in the same matches, i.e. which tuples of teammates are the best/worst against which;

I must admit I don't quite know how to approach the problem (as I said I have no formal education in statistics or data science). I would be very grateful if anyone could give me some directions. How should I frame this particular problem and what resources in statistics or machine learning (if indeed this is a task fit for machine learning, perhaps I am mistaken on this) would be appropriate to tackle it?

I am eager to learn, so both practical examples or theoretical references (book chapters, online articles, etc) would be very welcome.

Thanks in advance!

ideas and opinion on what kind of analyses needs to be done

Fri, 26 Jul 2019 18:38:55 +0000

Hello Guys,

I have been given a task where I need to analyse the impact of all the products that were discontinued last year on the customers and sales? I work for a retail company which has 100s of stores in North America and my analyses needs to be "Qualitatively better than the regular BI".

I quote that because thats what I was asked to do. I am struggling to come up with ideas. What kind of hypothesis should I test. Or any Bayesian analyses that can be done. Any ideas within the realms of data science and machine learning is welcome.

I generally do my analyses using Jupyter Notebook and my skill set consist of SQL, Python including ML libraries like sklearn

What are the most important Python libraries for data science?

Mon, 08 Jul 2019 04:42:28 +0000

Using aggregate data to generate observation-level data statistically sound?

Tue, 11 Jun 2019 22:04:01 +0000

Context: In the realm of Paid Search Marketing. Current reporting does not provide event level data only aggregate totals with different segments. Want to compare distributions/test statistical significance of A/B test results. Did not want to assume that data followed normal distribution or know STDEV for data so came with this approach.

My Question: I am going to use the average "CPA" or "CTR" for a date range, and generate an observation for each conversion based off the average for a time range. Is this statistically sound way if I want to generate raw data? Would I have wonky distributions because of the multiple averages? Just want a gutcheck if I'm completely off base.

My Aggregate data looks like below:

Day	Cost	Acquisition	CPA or CTR
1	40	2	$20
2	75	3	$25

Observation data I generate looks like below:

Day	Acquisition
1	$20
1	$20
2	$25
2	$25
2	$25

I really appreciate your help with this question! An important project to me at work.

What are the available libraries for continuous time hidden markov models ?

Fri, 07 Jun 2019 13:27:05 +0000

How to reshape in pandas dataframe?

Fri, 05 Apr 2019 13:41:30 +0000

Dataframe looks like below

I have dataframe like above. which I want to a~t reshape (a~t, 1)

I want to reshape dataframe like below ( b~t column is go to under the a column)

날짜 역번호 역명 구분 a

2018-01-01 150 서울역 승차 379

2018-01-01 150 서울역 승차 287

2018-01-01 150 서울역 승차 371

2018-01-01 150 서울역 승차 876

2018-01-01 150 서울역 승차 965

....

2008-01-01 152 종각 승차 2920

2008-01-01 152 종각 승차 2290

2008-01-01 152 종각 승차 802

2008-01-01 152 종각 승차 1559

like df = df.reshape(len(data2)*a~t, 1)

i tried pd.melt but It does not work well.

df2 = pd.melt(df, id_vars=["날짜", "역번호", "역명", "구분"], value_name="t")

is remove b ~ t but i want insert b~t behind a

dataset is https://drive.google.com/file/d/1Upb5PgymkPB5TXuta_sg6SijwzUuEkfl/view?usp=sharing

Is digital marketing and marketing internships worth it for a data science student?

Thu, 04 Apr 2019 22:58:54 +0000

it's about data science career

How to open Jupyter notebook files on Windows or Mac without web browser?

Sat, 16 Mar 2019 23:36:01 +0000

Passing variable length sentences to Tensorflow LSTM

Mon, 11 Feb 2019 05:06:27 +0000

I have a tensorflow LSTM model for predicting the sentiment. I build the model with the maximum sequence length 150. (Maximum number of words) While making predictions, i have written the code as below:

batchSize = 32
maxSeqLength = 150

def getSentenceMatrix(sentence):
    arr = np.zeros([batchSize, maxSeqLength])
    sentenceMatrix = np.zeros([batchSize,maxSeqLength], dtype='int32')
    cleanedSentence = cleanSentences(sentence)
    cleanedSentence = ' '.join(cleanedSentence.split()[:150])
    split = cleanedSentence.split()
    for indexCounter,word in enumerate(split):
        try:
            sentenceMatrix[0,indexCounter] = wordsList.index(word)
        except ValueError:
            sentenceMatrix[0,indexCounter] = 399999 #Vector for unkown words
    return sentenceMatrix

input_text = "example data"
inputMatrix = getSentenceMatrix(input_text)

In the code i'm truncating my input text to 150 words and ignoring remaining data.Due to this my predictions are wrong.

cleanedSentence = ' '.join(cleanedSentence.split()[:150])

I know that if we have lesser length than sequence length we can pad with zero's. What we need to do if we have more length. Can you suggest me the best way to do this. Thanks in advance.

What is the easiest way to distinguish whether to use a z value test or a t value test?

Tue, 18 Dec 2018 04:33:27 +0000

nonpooled independent samples t-interval method

Fri, 07 Dec 2018 23:06:04 +0000

Computing the Confidence Interval for a Difference Between Two Means

Fri, 07 Dec 2018 23:02:45 +0000

How to find the strength of a P-value against a null hypothesis?

Fri, 07 Dec 2018 22:31:33 +0000

What is degree of Freedom while calculating confidence interval?

Fri, 07 Dec 2018 22:27:27 +0000