I'm writing a decision tree algorithm in python for class for both continuous and categorical values and I'm having problems updating the database after choosing the best attribute.
I wrote functions called delete_rows and delete_attribute to cancel part of the examples and a column respectively each iteration.
(the algorithm is based on the pseudo-code found on the Russel-Norving textbook, which should be the ID3 version)
new_examples_list = delete_rows(examples_list, best_attr, v)
new_examples_list = delete_attribute(new_examples_list, best_attr)
I don't know numpy really well but after searching online I wrote it like this:
def delete_attribute(examples_list, attribute):
examples_list = numpy.delete(examples_list, attribute, axis=1)
return examples_list
The problem is that when I call it, all the data in examples_list (the matrix that has all the data of the database) is converted in string, even for the attributes that were originally float. Since I have to use different functions for categorical or numeric values and I check the type with a is_instance function, this causes problems in the following steps of the tree.
Can I solve this just by adjusting the delete_attribute function or it's probably a bigger problem?
I hope I explained myself throughly, I'm still new to python and this is my first time asking a question.
EDIT: I've added an example:
Say that my original data is like this (read from a csv)
titles = [A, B, C, D, Goal]
data = [[20,15,21,17,'No']
[40,16,33,8,'Yes']
[44,40,38,18,'No']
[18,16,21,2,'Yes']
[7,12,8,40,'Yes']]
the algorithm finds A to be the best attribute and a threshold to divide the data at 19. Say that we want to see the split data for the values of A > 19
The method delete_rows simply keeps the examples that fit this criteria and I get
data = [[20.0, 15.0, 21.0, 17.0, 'No']
[40.0, 16.0, 33.0, 8.0, 'Yes']
[44.0, 40.0, 38.0, 18.0, 'No']]
When I try to use delete_attribute as shown before to delete the column of A I get this:
data = [['15.0' '21.0' '17.0' 'No']
['16.0' '33.0' '8.0' 'Yes']
['40.0' '38.0' '18.0' 'No']]
I assume that since the original data has both numerical and string values it then converts anything to string? I'd like to just keep the last column of the result as string. Thank you.
In this example all the data is numerical but of course I'd have to consider also other databases with mixed values
question from:
https://stackoverflow.com/questions/66050617/numpy-delete-is-converting-float-values-into-string