Implementación de Árbol BK -- python campo con tree campo con python-3.x campo con cython camp codereview Relacionados El problema

BK Tree implementation


4
vote

problema

Español

Estoy trabajando en un sistema que puede buscar imágenes similares. Esto implica poder buscar por la distancia de edición, y como tal, he implementado una estructura de datos especial llamada BK TREA . Algunos Notas aquí .

La idea central aquí es poder buscar artículos por la distancia de edición, en este caso Hamming distancia < / a> entre artículos discretos. Mis hashes para este sistema se almacenan como enteros de 64 bits, pero fundamentalmente pueden considerarse campos de bits, y se tratan como tales.

En cualquier caso, tengo algún código de prueba que parece funcionar bastante bien, y apreciaría mucho algunas entradas.

  from libc.stdint cimport uint64_t  # Compute number of bits that are not common between `a` and `b`. # return value is a plain integer cdef uint64_t hamming(uint64_t a, uint64_t b):      cdef uint64_t x     cdef int tot      tot = 0      x = (a ^ b)     while x > 0:         tot += x & 1         x >>= 1     return tot  cdef class BkHammingNode(object):      cdef uint64_t nodeHash     cdef set nodeData     cdef dict children      def __init__(self, nodeHash, nodeData):         self.nodeData = set((nodeData, ))         self.children = {}         self.nodeHash = nodeHash      # Insert phash `nodeHash` into tree, with the associated data `nodeData`     cpdef insert(self, uint64_t nodeHash, nodeData):          # If the current node has the same has as the data we're inserting,         # add the data to the current node's data set         if nodeHash == self.nodeHash:             self.nodeData.add(nodeData)             return          # otherwise, calculate the edit distance between the new phash and the current node's hash,         # and either recursively insert the data, or create a new child node for the phash         distance = hamming(self.nodeHash, nodeHash)         if not distance in self.children:             self.children[distance] = BkHammingNode(nodeHash, nodeData)         else:             self.children[distance].insert(nodeHash, nodeData)      # Remove node with hash `nodeHash` and accompanying data `nodeData` from the tree.     # Returns list of children that must be re-inserted (or false if no children need to be updated),     # number of nodes deleted, and number of nodes that were moved as a 3-tuple.     cpdef remove(self, uint64_t nodeHash, nodeData):         cdef uint64_t deleted = 0         cdef uint64_t moved = 0          # If the node we're on matches the hash we want to delete exactly:         if nodeHash == self.nodeHash:              # Remove the node data associated with the hash we want to remove             self.nodeData.remove(nodeData)              # If we've emptied out the node of data, return all our children so the parent can             # graft the children into the tree in the appropriate place             if not self.nodeData:                 # 1 deleted node, 0 moved nodes, return all children for reinsertion by parent                 # Parent will pop this node, and reinsert all it's children where apropriate                 return list(self), 1, 0              # node has data remaining, do not do any rebuilding             return False, 1, 0           selfDist = hamming(self.nodeHash, nodeHash)          # Removing is basically searching with a distance of zero, and         # then doing operations on the search result.         # As such, scan children where the edit distance between `self.nodeHash` and the target `nodeHash` == 0         # Rebuild children where needed         if selfDist in self.children:             moveChildren, childDeleted, childMoved = self.children[selfDist].remove(nodeHash, nodeData)             deleted += childDeleted             moved += childMoved              # If the child returns children, it means the child no longer contains any unique data, so it             # needs to be deleted. As such, pop it from the tree, and re-insert all it's children as             # direct decendents of the current node             if moveChildren:                 self.children.pop(selfDist)                 for childHash, childData in moveChildren:                     self.insert(childHash, childData)                     moved += 1          return False, deleted, moved      # Get all child-nodes within an edit distance of `distance` from `baseHash`     # returns a set containing the data of each matching node, and a integer representing     # the number of nodes that were touched in the scan.     # Return value is a 2-tuple     cpdef getWithinDistance(self, uint64_t baseHash, int distance):         cdef uint64_t selfDist          selfDist = hamming(self.nodeHash, baseHash)          ret = set()          if selfDist <= distance:             ret |= set(self.nodeData)          touched = 1           for key in self.children.keys():             if key <= selfDist+distance and key >= selfDist-distance:                 new, tmpTouch = self.children[key].getWithinDistance(baseHash, distance)                 touched += tmpTouch                 ret |= new          return ret, touched      def __iter__(self):         for child in self.children.values():             for item in child:                 yield item         for item in self.nodeData:             yield (self.nodeHash, item)  class BkHammingTree(object):     root = None      def __init__(self):         self.nodes = 0      def insert(self, nodeHash, nodeData):         if not self.root:             self.root = BkHammingNode(nodeHash, nodeData)         else:             self.root.insert(nodeHash, nodeData)           self.nodes += 1      def remove(self, nodeHash, nodeData):         if not self.root:             raise ValueError("No tree built to remove from!")          rootless, deleted, moved = self.root.remove(nodeHash, nodeData)          # If the node we're deleting is the root node, we need to handle it properly         # if it is, overwrite the root node with one of the values returned, and then         # rebuild the entire tree by reinserting all the nodes         if rootless:             print("Tree root deleted! Rebuilding...")             rootHash, rootData = rootless.pop()             self.root = BkHammingNode(rootHash, rootData)             for childHash, childData in rootless:                 self.root.insert(childHash, childData)          self.nodes -= deleted          return deleted, moved      def getWithinDistance(self, baseHash, distance):         if not self.root:             return set()          ret, touched = self.root.getWithinDistance(baseHash, distance)         print("Touched %s tree nodes, or %1.3f%%" % (touched, touched/self.nodes * 100))         print("Discovered %s match(es)" % len(ret))         return ret      def __iter__(self):         for value in self.root:             yield value   

Esto está escrito en Python cedonizado, por razones de rendimiento (fue lento en python puro). En este momento, es bastante rápido (un árbol de 4,1 millones se construye en unos 20 segundos), pero he estado escondiendo sospechas. Me falta algo, especialmente porque todavía no entiendo profundamente los espacios métricos.

Creo que el árbol se implementa correctamente, pero no me sorprendería si me falta algo obvio.

Original en ingles

I'm working on a system that can search for similar images. This involves being able to search by edit-distance, and as such I've implemented a special data-structure called a BK tree. Some notes here.

The core idea here is to be able to search for items by edit distance, in this case Hamming distance between discrete items. My hashes for this system are stored as 64 bit integers, but fundamentally they can be considered bit-fields, and are treated as such.

In any event, I have some test code that seems to work fairly well, and I'd greatly appreciate some input.

from libc.stdint cimport uint64_t  # Compute number of bits that are not common between `a` and `b`. # return value is a plain integer cdef uint64_t hamming(uint64_t a, uint64_t b):      cdef uint64_t x     cdef int tot      tot = 0      x = (a ^ b)     while x > 0:         tot += x & 1         x >>= 1     return tot  cdef class BkHammingNode(object):      cdef uint64_t nodeHash     cdef set nodeData     cdef dict children      def __init__(self, nodeHash, nodeData):         self.nodeData = set((nodeData, ))         self.children = {}         self.nodeHash = nodeHash      # Insert phash `nodeHash` into tree, with the associated data `nodeData`     cpdef insert(self, uint64_t nodeHash, nodeData):          # If the current node has the same has as the data we're inserting,         # add the data to the current node's data set         if nodeHash == self.nodeHash:             self.nodeData.add(nodeData)             return          # otherwise, calculate the edit distance between the new phash and the current node's hash,         # and either recursively insert the data, or create a new child node for the phash         distance = hamming(self.nodeHash, nodeHash)         if not distance in self.children:             self.children[distance] = BkHammingNode(nodeHash, nodeData)         else:             self.children[distance].insert(nodeHash, nodeData)      # Remove node with hash `nodeHash` and accompanying data `nodeData` from the tree.     # Returns list of children that must be re-inserted (or false if no children need to be updated),     # number of nodes deleted, and number of nodes that were moved as a 3-tuple.     cpdef remove(self, uint64_t nodeHash, nodeData):         cdef uint64_t deleted = 0         cdef uint64_t moved = 0          # If the node we're on matches the hash we want to delete exactly:         if nodeHash == self.nodeHash:              # Remove the node data associated with the hash we want to remove             self.nodeData.remove(nodeData)              # If we've emptied out the node of data, return all our children so the parent can             # graft the children into the tree in the appropriate place             if not self.nodeData:                 # 1 deleted node, 0 moved nodes, return all children for reinsertion by parent                 # Parent will pop this node, and reinsert all it's children where apropriate                 return list(self), 1, 0              # node has data remaining, do not do any rebuilding             return False, 1, 0           selfDist = hamming(self.nodeHash, nodeHash)          # Removing is basically searching with a distance of zero, and         # then doing operations on the search result.         # As such, scan children where the edit distance between `self.nodeHash` and the target `nodeHash` == 0         # Rebuild children where needed         if selfDist in self.children:             moveChildren, childDeleted, childMoved = self.children[selfDist].remove(nodeHash, nodeData)             deleted += childDeleted             moved += childMoved              # If the child returns children, it means the child no longer contains any unique data, so it             # needs to be deleted. As such, pop it from the tree, and re-insert all it's children as             # direct decendents of the current node             if moveChildren:                 self.children.pop(selfDist)                 for childHash, childData in moveChildren:                     self.insert(childHash, childData)                     moved += 1          return False, deleted, moved      # Get all child-nodes within an edit distance of `distance` from `baseHash`     # returns a set containing the data of each matching node, and a integer representing     # the number of nodes that were touched in the scan.     # Return value is a 2-tuple     cpdef getWithinDistance(self, uint64_t baseHash, int distance):         cdef uint64_t selfDist          selfDist = hamming(self.nodeHash, baseHash)          ret = set()          if selfDist <= distance:             ret |= set(self.nodeData)          touched = 1           for key in self.children.keys():             if key <= selfDist+distance and key >= selfDist-distance:                 new, tmpTouch = self.children[key].getWithinDistance(baseHash, distance)                 touched += tmpTouch                 ret |= new          return ret, touched      def __iter__(self):         for child in self.children.values():             for item in child:                 yield item         for item in self.nodeData:             yield (self.nodeHash, item)  class BkHammingTree(object):     root = None      def __init__(self):         self.nodes = 0      def insert(self, nodeHash, nodeData):         if not self.root:             self.root = BkHammingNode(nodeHash, nodeData)         else:             self.root.insert(nodeHash, nodeData)           self.nodes += 1      def remove(self, nodeHash, nodeData):         if not self.root:             raise ValueError("No tree built to remove from!")          rootless, deleted, moved = self.root.remove(nodeHash, nodeData)          # If the node we're deleting is the root node, we need to handle it properly         # if it is, overwrite the root node with one of the values returned, and then         # rebuild the entire tree by reinserting all the nodes         if rootless:             print("Tree root deleted! Rebuilding...")             rootHash, rootData = rootless.pop()             self.root = BkHammingNode(rootHash, rootData)             for childHash, childData in rootless:                 self.root.insert(childHash, childData)          self.nodes -= deleted          return deleted, moved      def getWithinDistance(self, baseHash, distance):         if not self.root:             return set()          ret, touched = self.root.getWithinDistance(baseHash, distance)         print("Touched %s tree nodes, or %1.3f%%" % (touched, touched/self.nodes * 100))         print("Discovered %s match(es)" % len(ret))         return ret      def __iter__(self):         for value in self.root:             yield value 

This is written in cythonized Python, for performance reasons (it was slow in pure Python). Right now, it's pretty speedy (a 4.1M item tree builds in about 20 seconds), but I have sneaking suspicions I'm missing something, particularly as I still really don't deeply understand metric spaces.

I think the tree is implemented correctly, but it wouldn't surprise me if I'm missing something obvious.

           
       
       

Lista de respuestas

2
 
vote

Realmente no he comprobado la corrección, pero aquí hay algunos puntos de estilo general. No me he tomado el tiempo para entender realmente el algoritmo aquí; estos son Opiniones a nivel de superficie. Si desea un análisis más profundo de la corrección del código, le sugiero que me dé un arnés de prueba (solo algo para 99887766555443312 y ejecute el código), así que consiga cómo se usa.

No tengo críticas sustanciales al Código; Es en su mayoría muy limpio y legible.

Primero: DOCSTRINGS. En lugar de:

          if (!destination.exists()) {             destination.mkdir();         } 3  

escribir

          if (!destination.exists()) {             destination.mkdir();         } 4  

Si realmente está dirigido a 3.4 solamente, no escriba if (!destination.exists()) { destination.mkdir(); } 5 en la lista de herencias. Sin embargo, si también está deseando una compatibilidad 2.x, es bueno mantenerlo.

tienes

          if (!destination.exists()) {             destination.mkdir();         } 6  

Vale la pena señalar que estos no son mucho más rápidos que los atributos no imparciados de BOG, Pero siempre y cuando seas consciente, son un buen tipo de datos. Esto es en realidad por eso que espero PyPy dará mejores ventajas de velocidad.

en lugar de if (!destination.exists()) { destination.mkdir(); } 7 Simplemente escriba if (!destination.exists()) { destination.mkdir(); } 8 .

En if (!destination.exists()) { destination.mkdir(); } 9 TIENE String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 0 , pero solo agreguelo una vez. Donación Este tipo es inútil, especialmente porque su devolución lo lanza. a un tipo de pitón. Existe una preocupación similar pero menor para String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 1 .

en lugar de String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 2 do File23 . String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 4 es significaba para si usa el valor de retorno.

usted escribe:

  String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 5  

me parece que

  String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 6  

sería mejor. Si no, intente String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 7 .

No necesita llamar String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 8 aquí:

  String filePath = destination + File.separator + entry.getName(); File file = new File(filePath); 9  

esto:

  File0  

debería ser solo

  File1  

o incluso

  File2  

No sé cómo funcionaría esto con File33 SER UNO File44 99887766555443335 Siendo Python File6 , pero si simplemente elimina el File7 probablemente funcione.

probablemente debería evitar variables de clase como File8 y simplemente agregarlos a File9 . Las variables de clase son básicamente variables globales, y aunque tenga una inmutable global como Falta de Trabajo, se refiere a cómo Espero que se use el idioma.

 

I haven't really checked for correctness, but here are some general style points. I haven't taken the time to actually understand the algorithm here; these are surface-level opinions. If you want deeper analysis of the code correctness, I suggest you give me a testing harness (just something to pyximport and run the code), so I get a look at how it's used.

I don't have substantial criticism of the code; it's mostly really neat and readable.

First: docstrings. Instead of:

# Compute number of bits that are not common between `a` and `b`. # return value is a plain integer cdef uint64_t hamming(uint64_t a, uint64_t b): 

write

cdef uint64_t hamming(uint64_t a, uint64_t b):     """     Compute number of bits that are not common between `a` and `b`.     return value is a plain integer     """ 

If you're actually targetting 3.4 only, don't write (object) in the inheritance list. If you're potentially wanting 2.x compatibility as well, though, it's good to keep it.

You have

    cdef set nodeData     cdef dict children 

It's worth noting that these aren't much faster than bog-standard untyped attributes, but as long as you're aware they are a good datatype. This is actually why I expect PyPy will give better speed advantages.

Instead of set((nodeData, )) just write self.nodeData = {nodeData}.

In remove you have cdef uint64_t deleted = 0, but only add to it once. Giving this a type is pointless, especially because your return casts it to a Python type. A similar but lesser concern exists for moved.

Instead of self.children.pop(selfDist) do del self.children[selfDist]. pop is meant for if you use the return value.

You write:

        ret = set()          if selfDist <= distance:             ret |= set(self.nodeData) 

It seems to me that

        ret = set()          if selfDist <= distance             ret = set(self.nodeData) 

would be better. If not, try ret.update(self.nodeData).

You don't need to call .keys() here:

        for key in self.children.keys(): 

This:

            if key <= selfDist+distance and key >= selfDist-distance: 

should just be

            if selfDist-distance <= key <= selfDist+distance: 

or even

            if abs(key - selfDist) <= distance: 

I don't know how this would work with selfDist being a uint64_t and key being a Python int, but if you just remove the cdef uint64_t it'll likely work.

You should probably avoid class variables like root = None and just add them to __init__. Class variables are basically global variables, and although having an immutable global as fallback works, it goes counter to how I'd expect the language to be used.

 
 
         
         

Relacionados problema

5  Encontrar ciclos en un gráfico que pasa a través de un vértice como máximo k veces  ( Finding cycles in a graph that pass through a vertex at most k times ) 
Tengo un proyecto que se basa en encontrar todos los ciclos en un gráfico que pase a través de un vértice como máximo K Times. Naturalmente, estoy pegando con...

1  Resolución de Laplacian Cythonic en una cuadrícula regular 3D  ( Cythonic laplacian solving on a 3d regular grid ) 
He seguido a OFICIAL cython para usuarios numeros Para hacer un solucionador de Laplacian en una cuadrícula regular 3D con condiciones de Dirichlet. Ahora...

4  Acelerando un programa de cython  ( Speeding up a cython program ) 
Escribí la siguiente pieza de Python como parte de un sistema más grande. El perfil revela que se gasta una gran cantidad de tiempo en DocumentFeature.from_s...

4  Calculando la distancia cuadrada entre todos los pares de vértices de varios polígonos 2D  ( Calculating the distance squared between all vertex pairs of a number of 2d poly ) 
He implementado mi código usando Cython. Es el cuello de botella actual en mis cálculos. Hay dos funciones no adormecidas involucradas: calculate_2D_dis...

2  Indexación de bits aleatorios más rápida / eficiente en Python  ( Fastest most efficient random bit indexing in python ) 
Usando Python / cython, estoy tratando de construir una forma eficiente de indexar los valores booleanos. He estado tratando de usar la bitarray y / o Paqu...

3  Escribiendo eficientemente las funciones de comparación de cadenas en Python  ( Efficiently writing string comparison functions in python ) 
Digamos que trabajo para una empresa que da lugar a diferentes tipos de préstamos. Estamos obteniendo nuestra información de préstamos de un Big Data Mart des...

6  Cython más rápido posible para algoritmo Black-Scholes  ( Fastest possible cython for black scholes algorithm ) 
Comencé con una implementación pura python, y he estado tratando de obtener el rendimiento lo más cerca posible de los C posibles usando Numpy, Numexpr y Cyth...

16  Usando muchas sustituciones regex para tokenizar texto  ( Using lots of regex substitutions to tokenize text ) 
Autímico un pedazo de código que se fusionó en la < Código> nltk CodeBase. Está lleno de sustituciones regex: import re from six import text_type from ...

3  ¿Dónde está el cuello de botella en mi código de cython?  ( Where is the bottleneck in my cython code ) 
El perfil me dice que tomó ~ 15s correr, pero sin decirme más. Sub AjoutSemaineajouterperfo() ' AjoutSemaineajouterperfo Macro ' Le code permet d'...

6  Optimizando un algoritmo de raycasting / voxelspace en Pygame  ( Optimizing a raycasting voxelspace algorithm in pygame ) 
En los últimos tres días, he estado desarrollando un simple algoritmo de Raycasting. He hecho el trabajo de renderizador ya que me gustaría. Sin embargo, es e...




© 2022 respuesta.top Reservados todos los derechos. Centro de preguntas y respuestas reservados todos los derechos