Question about a particular paper on categorical data

In summary, the paper discusses 4 heuristics (d_m, f_m, n_x, f_x) for mapping categorical data to numerical values. However, the values for n_x and f_x cannot be reproduced by the author or the person responsible for the calculations. The formula for n_x is not consistent with the statement in the paper and the notation in the formula for f_x is unclear. It is suggested to contact the authors for clarification.
  • #1
ZPlayer
35
0
I am not sure this is the right forum for this -- I have a question about a particular paper:

http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

The authors describe 4 heuristics that can be derived from categorical data -- this is in order to map categorical data to numerical. These heuristics are d_m, f_m, n_x, f_x. They also provide two examples y and z and the values of the quantities above computed with respect to dataset in table 3. I am able to lock into their values exactly for d_m and f_m but I cannot reproduce n_x and f_x.

Could someone read this paper and try to derive these values? I basically take it their equation (3.3) shows summation of reciprocals of arity for A_x set (i.e. the set of mismatching attributes) -- I can't reproduce -5.45 and -7.90.

Please note I already contacted the authors -- one responded that Dr. Boriah is the person responsible for these calculations but he is apparently not reachable.
 
Physics news on Phys.org
  • #2
ZPlayer said:
I am not sure this is the right forum for this -- I have a question about a particular paper:

http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

What that paper proposes to do is very interesting, but will understanding it be worth dealing with its problems !

I cannot reproduce n_x and f_x.

I can't either.

It's interesting that ##z = (a_3,b_2,c_{10},d_5) ## has attribute ##a_3## that does not occur in the "reference" data set. I wonder if that example is supposed to emphasize that you can compute the statistics when such a situation comes up.

The formula (3.3) ##n_x = -\sum_{i \in A_x} \frac{1}{n_i}\ ## is not consistent with the passage in the article that says:
The statistic ##n_x## is a function of the arity of the mismatching attributes between an instance and a reference data set. In particular, the value of the statistic is higher when the mismatching attributes have lower arity, i.e. they take fewer values.

In the formula, lower airty would produce a "more negative" contribution and the statistic would be lower instead of higher.

The notation in formula (3.4) ##\ f_x = -\sum_{i \in A_x} ( \frac{1}{z_i} + \frac{1}{y_i}) \ ## doesn't make sense to me because ##z_i## and ##y_i## are values of categories ( like "smooth" and "urban") , not numbers.
 

Related to Question about a particular paper on categorical data

1. What is categorical data?

Categorical data is a type of data that represents characteristics or qualities that can be divided into categories. Examples of categorical data include gender, race, hair color, and education level.

2. What are the different types of categorical data?

There are three main types of categorical data: nominal, ordinal, and binary. Nominal data represents categories with no inherent order, while ordinal data has a natural ordering to its categories. Binary data only has two categories, such as yes or no, or true or false.

3. How is categorical data analyzed?

Categorical data can be analyzed using various statistical methods, such as chi-square tests, contingency tables, and logistic regression. These methods help to determine if there is a significant relationship between the categories in the data.

4. What are the advantages of using categorical data?

Categorical data is useful for organizing and summarizing large amounts of data into meaningful categories. It also allows for easy comparison and visualization of data, making it easier to identify patterns and trends.

5. Can categorical data be converted into numerical data?

Yes, categorical data can be converted into numerical data using various methods such as dummy coding, where categories are assigned numerical values, or by creating new numerical variables based on the categories. However, it is important to consider the context and purpose of the data before converting it, as it may affect the analysis and interpretation of the data.

Similar threads

  • Advanced Physics Homework Help
Replies
2
Views
1K
Replies
24
Views
7K

Back
Top