<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Ask Ghassem - Recent questions tagged data</title>
<link>https://ask.ghassem.com/tag/data</link>
<description>Powered by Question2Answer</description>
<item>
<title>How do I know which encoder to use to convert from categorical variables to numerical?</title>
<link>https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</link>
<description>So say I have a column with categorical data like different styles of temperature: &amp;#039;Lukewarm&amp;#039;, &amp;#039;Hot&amp;#039;, &amp;#039;Scalding&amp;#039;, &amp;#039;Cold&amp;#039;, &amp;#039;Frostbite&amp;#039;,... etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I know that we can use pd.get_dummies to convert the column to numerical data within the dataframe, but I also know that there are other &amp;#039;converters&amp;#039; (not sure if that&amp;#039;s the correct terminology) that we can use, i.e. OneHotEncoder from Sk-learn (like I could use the pipeline module to make a nice pipeline and feed my dataframe through the pipeline to also get my categorical data encoded to numerical).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
How do I know which to use? Does it matter? If it does matter, when does it matter the most (i.e. what types of problems? When there are lots of categorical variables, or few?) If anyone can give me any pointers on this type of stuff I&amp;#039;d greatly appreciate it.</description>
<category>Exploratory Data Analysis</category>
<guid isPermaLink="true">https://ask.ghassem.com/1006/know-which-encoder-convert-categorical-variables-numerical</guid>
<pubDate>Mon, 29 Nov 2021 04:09:06 +0000</pubDate>
</item>
<item>
<title>ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements</title>
<link>https://ask.ghassem.com/1005/valueerror-length-mismatch-expected-elements-2935849-elements</link>
<description>&lt;p&gt;I&#039;m creating a new data frame&amp;nbsp;with the most used items grouped together. But I got the following error when grouping through ID and items.&amp;nbsp;ValueError: Length mismatch: Expected axis has 60 elements, new values have 2935849 elements.&lt;/p&gt;

&lt;pre class=&quot;prettyprint lang-python&quot; data-pbcklang=&quot;python&quot; data-pbcktabsize=&quot;4&quot;&gt;
df = sales_df[sales_df[&#039;shop_id&#039;].duplicated(keep=False)]
df[&#039;Grouped&#039;] = sales_df.groupby(&#039;shop_id&#039;)[&#039;item_name&#039;].transform(lambda x: &#039;,&#039;.join(x))
df2 = df[[&#039;shop_id&#039;, &#039;Grouped&#039;]].drop_duplicates()&lt;/pre&gt;

&lt;p&gt;In the aforementioned code, I&#039;m making a data frame with respect to shop id and then grouping through shop items. My objective here is to group items with similar ID.&lt;/p&gt;</description>
<category>Exploratory Data Analysis</category>
<guid isPermaLink="true">https://ask.ghassem.com/1005/valueerror-length-mismatch-expected-elements-2935849-elements</guid>
<pubDate>Fri, 26 Nov 2021 06:09:16 +0000</pubDate>
</item>
<item>
<title>Classification of data object might be incorrect</title>
<link>https://ask.ghassem.com/1003/classification-of-data-object-might-be-incorrect</link>
<description>&lt;p&gt;I am learning a new Salesforce product (Evergage) for the company I work for. In the program&#039;s documentation they have listed a set of data objects as an example. It appears to me that the classification might be incorrect. Their system makes a division between &#039;catalog objects&#039; and &#039;profile objects&#039; and the example they have given is a banking institution. They classified &lt;em&gt;Customer Credit Card &lt;/em&gt;as a &lt;em&gt;profile objec&lt;/em&gt;t and &lt;em&gt;Credit Card Level &lt;/em&gt;as a &lt;em&gt;catalog object. &lt;/em&gt;Seems to me that it should be the other way i.e &lt;em&gt;Customer Credit Card = catalog &lt;/em&gt;&lt;em&gt;object &lt;/em&gt;and &lt;em&gt;Credit Card Level &lt;/em&gt;=&amp;nbsp;&lt;em&gt;profile objec&lt;/em&gt;t. Maybe I am not reading the context correctly?&lt;/p&gt;

&lt;p&gt;here is a link to an image with the complete classification: &lt;a rel=&quot;nofollow&quot; href=&quot;https://drive.google.com/file/d/1nG4aX4Ty_NoHxm04AQo1Ow61m3MZ3pXm/view?usp=sharing&quot;&gt;https://drive.google.com/file/d/1nG4aX4Ty_NoHxm04AQo1Ow61m3MZ3pXm/view?usp=sharing&lt;/a&gt;&lt;/p&gt;</description>
<category>General</category>
<guid isPermaLink="true">https://ask.ghassem.com/1003/classification-of-data-object-might-be-incorrect</guid>
<pubDate>Mon, 25 Oct 2021 15:26:46 +0000</pubDate>
</item>
<item>
<title>how many samples do we need to test image segmentation using synthetic data ?</title>
<link>https://ask.ghassem.com/993/many-samples-need-test-image-segmentation-using-synthetic</link>
<description>Hello,&lt;br /&gt;
&lt;br /&gt;
I trained a CNN using synthetic data to perform a segmentation task on human faces. During the test and to evaluate the prediction of this network, I used 200 examples from the database to compute precision and recall.&lt;br /&gt;
&lt;br /&gt;
Is this number sufficient, knowing that I control myself the data generator and that I build the database by randomly drawing the elements using centered Gaussian distributions.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Thank you,</description>
<category>Deep Learning</category>
<guid isPermaLink="true">https://ask.ghassem.com/993/many-samples-need-test-image-segmentation-using-synthetic</guid>
<pubDate>Mon, 21 Jun 2021 12:26:32 +0000</pubDate>
</item>
<item>
<title>How best to ensure data quality?</title>
<link>https://ask.ghassem.com/990/how-best-to-ensure-data-quality</link>
<description></description>
<category>Data Science</category>
<guid isPermaLink="true">https://ask.ghassem.com/990/how-best-to-ensure-data-quality</guid>
<pubDate>Tue, 08 Jun 2021 22:02:23 +0000</pubDate>
</item>
<item>
<title>How to calculate average with deviating sensors?</title>
<link>https://ask.ghassem.com/983/how-to-calculate-average-with-deviating-sensors</link>
<description>In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for recalibration is to be neglected. I&amp;#039;m in need of an (excel) formula looking at three columns which row-by-row detects a significant deviation compared to the others and calculate the average of the most trustworthy.&lt;br /&gt;
Example:&lt;br /&gt;
48.1 ; 45.2 ; 45.4 =&amp;gt; 45.3, as sensor 1 is way off....&lt;br /&gt;
36.0 ; 37;0 ; 45.0 =&amp;gt; 36.5, as sensor 3 is way off....&lt;br /&gt;
36.0 ; 36;5 ; 37.0 =&amp;gt; 36.5 as the deviation is too small to be considered an anomaly, so all values are valid to create the average.&lt;br /&gt;
&lt;br /&gt;
Working with long periods of time.. the readings might be trustworthy for a few weeks, but in defect from moment X up until now... so simply ruling out one sensor is not really an option either.. What is the best way forward?&lt;br /&gt;
Please help. Highly appreciated.</description>
<category>Data Science</category>
<guid isPermaLink="true">https://ask.ghassem.com/983/how-to-calculate-average-with-deviating-sensors</guid>
<pubDate>Tue, 04 May 2021 14:39:14 +0000</pubDate>
</item>
<item>
<title>Do  I need to save the standardization transformation?</title>
<link>https://ask.ghassem.com/970/do-i-need-to-save-the-standardization-transformation</link>
<description>When I standardized my data when I created my model. Do I need to save the standardization transformation when I want to predict with my model new data ?</description>
<category>Machine Learning</category>
<guid isPermaLink="true">https://ask.ghassem.com/970/do-i-need-to-save-the-standardization-transformation</guid>
<pubDate>Tue, 15 Dec 2020 13:06:48 +0000</pubDate>
</item>
<item>
<title>is it possible to derive a new 95% CI from two separate 95% CIs?</title>
<link>https://ask.ghassem.com/961/is-it-possible-to-derive-a-new-95-ci-from-two-separate-95-cis</link>
<description>&lt;div id=&quot;i4c-draggable-container&quot; style=&quot;position: fixed; z-index: 1499; width: 0px; height: 0px;&quot;&gt;
&lt;div class=&quot;resolved&quot; data-reactroot=&quot;&quot; style=&quot;all: initial;&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&quot;position: fixed; z-index: 1499; width: 0px; height: 0px;&quot;&gt;
&lt;div style=&quot;all: initial;&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&quot;position: fixed; z-index: 1499; width: 0px; height: 0px;&quot;&gt;
&lt;div style=&quot;all: initial;&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;div&gt;&amp;nbsp;&lt;/div&gt;

&lt;div&gt;&amp;nbsp;&lt;/div&gt;

&lt;div id=&quot;i4c-dialogs-container&quot;&gt;&amp;nbsp;&lt;/div&gt;</description>
<category>Statistics</category>
<guid isPermaLink="true">https://ask.ghassem.com/961/is-it-possible-to-derive-a-new-95-ci-from-two-separate-95-cis</guid>
<pubDate>Mon, 23 Nov 2020 14:45:19 +0000</pubDate>
</item>
<item>
<title>How to predict from unseen data?</title>
<link>https://ask.ghassem.com/954/how-to-predict-from-unseen-data</link>
<description>&lt;p&gt;Hi. I have a question about model-based predictions when data is only available after the fact.&amp;nbsp;Let me give you an example. I try to predict the result (HOME,&amp;nbsp;AWAY or a DRAW) of the match based on data like number of shots, ball possession, number of fouls, etc.&lt;/p&gt;

&lt;table border=&quot;1&quot; cellpadding=&quot;1&quot; style=&quot;width:500px&quot;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th scope=&quot;col&quot;&gt;TARGET&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;TEAM 1&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;TEAM 2&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;possesion&lt;/p&gt;

&lt;p&gt;team 1&lt;/p&gt;
&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;possesion&lt;/p&gt;

&lt;p&gt;team 2&lt;/p&gt;
&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;shots&lt;/p&gt;

&lt;p&gt;team 1&lt;/p&gt;
&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;shots&lt;/p&gt;

&lt;p&gt;team 2&lt;/p&gt;
&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;fouls&lt;/p&gt;

&lt;p&gt;team 1&lt;/p&gt;
&lt;/th&gt;
&lt;th scope=&quot;col&quot;&gt;
&lt;p&gt;fouls&lt;/p&gt;

&lt;p&gt;team 2&lt;/p&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HOME&lt;/td&gt;
&lt;td&gt;Arsenal&lt;/td&gt;
&lt;td&gt;Chelsea&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Let&#039;s say I&#039;m already after training the model and I want to see if I can predict the upcoming match. However, this match is only a few days away and I want to know the result of the model today.&amp;nbsp;I understand that if the match had already taken place and I had the data, I could test it on the model and get the result. The goal is for the model to predict what will happen before the match.&lt;/p&gt;

&lt;p&gt;Is it possible at all? What are my options? Should I only select pre-match variables? For example, last game form, match referee etc or should I aggregate the variables and include average possession, average shots and average number of fouls from recent matches?&lt;/p&gt;</description>
<category>Machine Learning</category>
<guid isPermaLink="true">https://ask.ghassem.com/954/how-to-predict-from-unseen-data</guid>
<pubDate>Tue, 17 Nov 2020 16:18:28 +0000</pubDate>
</item>
<item>
<title>What are the most common data types in data science?</title>
<link>https://ask.ghassem.com/834/what-are-the-most-common-data-types-in-data-science</link>
<description>What are the main data types?</description>
<category>General</category>
<guid isPermaLink="true">https://ask.ghassem.com/834/what-are-the-most-common-data-types-in-data-science</guid>
<pubDate>Wed, 19 Feb 2020 17:28:54 +0000</pubDate>
</item>
</channel>
</rss>