Ask Ghassem - Recent questions tagged data

How do I know which encoder to use to convert from categorical variables to numerical?

Mon, 29 Nov 2021 04:09:06 +0000

So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.

I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other 'converters' (not sure if that's the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).

How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I'd greatly appreciate it.

ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements

Fri, 26 Nov 2021 06:09:16 +0000

I'm creating a new data frame with the most used items grouped together. But I got the following error when grouping through ID and items. ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements.

df = sales_df[sales_df['shop_id'].duplicated(keep=False)]
df['Grouped'] = sales_df.groupby('shop_id')['item_name'].transform(lambda x: ','.join(x))
df2 = df[['shop_id', 'Grouped']].drop_duplicates()

In the aforementioned code, I'm making a data frame with respect to shop id and then grouping through shop items. My objective here is to group items with similar ID.

Classification of data object might be incorrect

Mon, 25 Oct 2021 15:26:46 +0000

I am learning a new Salesforce product (Evergage) for the company I work for. In the program's documentation they have listed a set of data objects as an example. It appears to me that the classification might be incorrect. Their system makes a division between 'catalog objects' and 'profile objects' and the example they have given is a banking institution. They classified Customer Credit Card as a profile object and Credit Card Level as a catalog object. Seems to me that it should be the other way i.e Customer Credit Card = catalog object and Credit Card Level = profile object. Maybe I am not reading the context correctly?

here is a link to an image with the complete classification: https://drive.google.com/file/d/1nG4aX4Ty_NoHxm04AQo1Ow61m3MZ3pXm/view?usp=sharing

how many samples do we need to test image segmentation using synthetic data ?

Mon, 21 Jun 2021 12:26:32 +0000

Hello,

I trained a CNN using synthetic data to perform a segmentation task on human faces. During the test and to evaluate the prediction of this network, I used 200 examples from the database to compute precision and recall.

Is this number sufficient, knowing that I control myself the data generator and that I build the database by randomly drawing the elements using centered Gaussian distributions.

Thank you,

How best to ensure data quality?

Tue, 08 Jun 2021 22:02:23 +0000

How to calculate average with deviating sensors?

Tue, 04 May 2021 14:39:14 +0000

In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for recalibration is to be neglected. I'm in need of an (excel) formula looking at three columns which row-by-row detects a significant deviation compared to the others and calculate the average of the most trustworthy.
Example:
48.1 ; 45.2 ; 45.4 => 45.3, as sensor 1 is way off....
36.0 ; 37;0 ; 45.0 => 36.5, as sensor 3 is way off....
36.0 ; 36;5 ; 37.0 => 36.5 as the deviation is too small to be considered an anomaly, so all values are valid to create the average.

Working with long periods of time.. the readings might be trustworthy for a few weeks, but in defect from moment X up until now... so simply ruling out one sensor is not really an option either.. What is the best way forward?
Please help. Highly appreciated.

Do I need to save the standardization transformation?

Tue, 15 Dec 2020 13:06:48 +0000

When I standardized my data when I created my model. Do I need to save the standardization transformation when I want to predict with my model new data ?

is it possible to derive a new 95% CI from two separate 95% CIs?

Mon, 23 Nov 2020 14:45:19 +0000

How to predict from unseen data?

Tue, 17 Nov 2020 16:18:28 +0000

Hi. I have a question about model-based predictions when data is only available after the fact. Let me give you an example. I try to predict the result (HOME, AWAY or a DRAW) of the match based on data like number of shots, ball possession, number of fouls, etc.

TARGET	TEAM 1	TEAM 2	possesion team 1	possesion team 2	shots team 1	shots team 2	fouls team 1	fouls team 2
HOME	Arsenal	Chelsea	60	40	12	8	5	7

TARGET

TEAM 1

TEAM 2

possesion

team 1

possesion

team 2

shots

team 1

shots

team 2

fouls

team 1

fouls

team 2

HOME

Arsenal

Chelsea

Let's say I'm already after training the model and I want to see if I can predict the upcoming match. However, this match is only a few days away and I want to know the result of the model today. I understand that if the match had already taken place and I had the data, I could test it on the model and get the result. The goal is for the model to predict what will happen before the match.

Is it possible at all? What are my options? Should I only select pre-match variables? For example, last game form, match referee etc or should I aggregate the variables and include average possession, average shots and average number of fouls from recent matches?

What are the most common data types in data science?

Wed, 19 Feb 2020 17:28:54 +0000

What are the main data types?