¿Puede desplegar Scrapy Spider en Heroku? -- python campo con flask campo con heroku camp askubuntu Relacionados El problema

Can you deploy scrapy spider on heroku?


0
vote

problema

Español

He creado una aplicación de matraz que tiene un rastreador de seguridad incrustado en él. Actualmente se despliega en Heroku con los dynos "Hobby". Sin embargo, el código funciona localmente sin un problema, en Heroku, los registros muestran que la araña ha comenzado, pero no pasa nada.

Registros:

  2021-02-16T23:28:05.389393+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.telnet] INFO: Telnet Password: be5b2621088af046 2021-02-16T23:28:05.423417+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled extensions: 2021-02-16T23:28:05.423419+00:00 app[web.1]: ['scrapy.extensions.corestats.CoreStats', 2021-02-16T23:28:05.423419+00:00 app[web.1]:  'scrapy.extensions.telnet.TelnetConsole', 2021-02-16T23:28:05.423420+00:00 app[web.1]:  'scrapy.extensions.memusage.MemoryUsage', 2021-02-16T23:28:05.423420+00:00 app[web.1]:  'scrapy.extensions.logstats.LogStats'] 2021-02-16T23:28:05.485882+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled downloader middlewares: 2021-02-16T23:28:05.485884+00:00 app[web.1]: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 2021-02-16T23:28:05.485884+00:00 app[web.1]:  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 2021-02-16T23:28:05.485885+00:00 app[web.1]:  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 2021-02-16T23:28:05.485885+00:00 app[web.1]:  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.retry.RetryMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 2021-02-16T23:28:05.485887+00:00 app[web.1]:  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 2021-02-16T23:28:05.485887+00:00 app[web.1]:  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 2021-02-16T23:28:05.485888+00:00 app[web.1]:  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 2021-02-16T23:28:05.485888+00:00 app[web.1]:  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-02-16T23:28:05.490533+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled spider middlewares: 2021-02-16T23:28:05.490534+00:00 app[web.1]: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 2021-02-16T23:28:05.490534+00:00 app[web.1]:  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 2021-02-16T23:28:05.490535+00:00 app[web.1]:  'scrapy.spidermiddlewares.referer.RefererMiddleware', 2021-02-16T23:28:05.490535+00:00 app[web.1]:  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 2021-02-16T23:28:05.490536+00:00 app[web.1]:  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-02-16T23:28:05.492134+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled item pipelines: 2021-02-16T23:28:05.492135+00:00 app[web.1]: [] 2021-02-16T23:28:05.492285+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.core.engine] INFO: Spider opened 2021-02-16T23:28:05.494705+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-02-16T23:28:05.496552+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025   

Mi código:

  from flask import Flask, render_template, request, redirect, url_for, make_response, session from flask_executor import Executor from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerRunner from twisted.internet import reactor from urllib.parse import urlparse from uuid import uuid4 import urllib3, requests, urllib.request, urllib.parse, sys  app = Flask(__name__) app.secret_key = uuid4().__str__() executor = Executor(app)  http = urllib3.PoolManager() runner = CrawlerRunner() id = uuid4().__str__() has_done = False  list = set([]) list_validate = set([]) list_final = set([])   @app.route('/', methods=["POST", "GET"]) def index():      if request.method == "POST":         url_input = request.form["usr_input"]          if url_input == '':             return render_template('index.html', error_display="block", input=url_input)          elif url_input != '':              if 'https://' in url_input and url_input[-1] == '/':                 url = str(url_input)             elif 'https://' in url_input and url_input[-1] != '/':                 url = str(url_input) + '/'             elif 'https://' not in url_input and url_input[-1] != '/':                 url = 'https://' + str(url_input) + '/'             elif 'https://' not in url_input and url_input[-1] == '/':                 url = 'https://' + str(url_input)              try:                 response = requests.get(url)                 error = http.request("GET", url)                 if error.status == 200:                     parse = urlparse(url).netloc.split('.')                     base_url = parse[-2] + '.' + parse[-1]                     start_url = [str(url)]                     allowed_url = [str(base_url)]                      list.clear()                     list_validate.clear()                     list_final.clear()                     session["url_input"] = url_input                      class MyCrawler(CrawlSpider):                          name = "Spider"                         start_urls = start_url                         allowed_domains = allowed_url                         rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]                          def parse_links(self, response):                             base_url = url                             href = response.xpath('//a/@href').getall()                             list.add(urllib.parse.quote(response.url, safe=':/'))                             for link in href:                                 if base_url not in link:                                     list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))                             for link in list:                                 if base_url in link:                                     list_validate.add(link)                      def start():                         reactor.run(0)                     def start_crawl():                         d = runner.crawl(MyCrawler)                         def validate_links(d):                             for link in list_validate:                                 error = http.request("GET", link)                                  if error.status == 200:                                     r = urllib.request.urlopen(link)                                     url_type = r.headers.get_content_type()                                     if url_type == "text/html" or url_type == "application/pdf" or url_type == "application/xml":                                         list_final.add(link)                              if url in list_final:                                 list_final.remove(url)                             links_count = str(len(list_final) + 1)                              original_stdout = sys.stdout                             with open('templates/file.xml', 'w') as f:                                 sys.stdout = f                                 print(f'<!--Total Links: {links_count}-->')                                 for link in list_final:                                     print(link)                                 sys.stdout = original_stdout                                 f.close()                              print('Finished')                             global has_done                             has_done = True                         d.addCallback(validate_links)                      executor.submit_stored(id, start_crawl)                     executor.submit(start)                     return redirect(url_for('crawling', id=id))                  elif error.status != 200:                     return render_template('index.html', error_display="block", input=url_input)              except requests.ConnectionError as exception:                 return render_template('index.html', error_display="block", input=url_input)      else:         return render_template('index.html', error_display="none")   @app.route('/crawling-<string:id>a') def crawling(id):     url_input = session["url_input"]     if not executor.futures.done(id):         return render_template('start-crawl.html', refresh=True, website=url_input)     else:         executor.futures.pop(id)         unique_id = uuid4().__str__()         return redirect(url_for('validate', id=unique_id))   @app.route('/crawling-<string:id>b') def validate(id):     x = 5     while x == 5:         if has_done == False:             url_input = session["url_input"]             return render_template('start-crawl.html', refresh=True, website=url_input)         elif has_done:             return redirect(url_for('crawled', id=id))   @app.route('/crawled-<string:id>') def crawled(id):     global has_done     has_done = False     url_input = session["url_input"]     total_links = str(len(list_final) + 1)     return render_template('finish-crawl.html', number_links=total_links, website=url_input, file_id=id)   @app.route('/file-<string:file_id>') def sitemap(file_id):     template = render_template('file.xml')     response = make_response(template)     response.headers['Content-Type'] = 'application/xml'     return response    if __name__ == '__main__':     app.run(debug=True, threaded=True)   

Sé que mi código no es la mejor manera de realizar la tarea, pero funciona. Mi pregunta es ¿cómo puedo correr scrapy en Heroku? Tengo un requirements.txt que contiene todos los paquetes necesarios. No estoy seguro, pero creo que necesito un tipo de paquete de construcción de Heroku. Vi un tutorial sobre cómo desplegar con Selenium a Heroku. En ese video, la persona instaló ciertos paquetes de compilación y las variables configuradas para que el selenio funcione. Creo que podría haber paquetes de construcción requeridos para Scrapy, pero no sé qué son. Gracias de antemano a todos.

Tutorial para implementar con Selenium: https://www.youtube.com/watch ? v = ven-pqwk3ec .

Original en ingles

I created a flask app that has a Scrapy crawler embedded into it. It is currently deployed to Heroku with "hobby" dynos. The code works locally without a problem however, on Heroku the logs show that the spider has begun but nothing happens.

logs:

2021-02-16T23:28:05.389393+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.telnet] INFO: Telnet Password: be5b2621088af046 2021-02-16T23:28:05.423417+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled extensions: 2021-02-16T23:28:05.423419+00:00 app[web.1]: ['scrapy.extensions.corestats.CoreStats', 2021-02-16T23:28:05.423419+00:00 app[web.1]:  'scrapy.extensions.telnet.TelnetConsole', 2021-02-16T23:28:05.423420+00:00 app[web.1]:  'scrapy.extensions.memusage.MemoryUsage', 2021-02-16T23:28:05.423420+00:00 app[web.1]:  'scrapy.extensions.logstats.LogStats'] 2021-02-16T23:28:05.485882+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled downloader middlewares: 2021-02-16T23:28:05.485884+00:00 app[web.1]: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 2021-02-16T23:28:05.485884+00:00 app[web.1]:  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 2021-02-16T23:28:05.485885+00:00 app[web.1]:  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 2021-02-16T23:28:05.485885+00:00 app[web.1]:  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.retry.RetryMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 2021-02-16T23:28:05.485886+00:00 app[web.1]:  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 2021-02-16T23:28:05.485887+00:00 app[web.1]:  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 2021-02-16T23:28:05.485887+00:00 app[web.1]:  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 2021-02-16T23:28:05.485888+00:00 app[web.1]:  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 2021-02-16T23:28:05.485888+00:00 app[web.1]:  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-02-16T23:28:05.490533+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled spider middlewares: 2021-02-16T23:28:05.490534+00:00 app[web.1]: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 2021-02-16T23:28:05.490534+00:00 app[web.1]:  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 2021-02-16T23:28:05.490535+00:00 app[web.1]:  'scrapy.spidermiddlewares.referer.RefererMiddleware', 2021-02-16T23:28:05.490535+00:00 app[web.1]:  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 2021-02-16T23:28:05.490536+00:00 app[web.1]:  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-02-16T23:28:05.492134+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.middleware] INFO: Enabled item pipelines: 2021-02-16T23:28:05.492135+00:00 app[web.1]: [] 2021-02-16T23:28:05.492285+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.core.engine] INFO: Spider opened 2021-02-16T23:28:05.494705+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-02-16T23:28:05.496552+00:00 app[web.1]: 2021-02-16 23:28:05 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025 

My Code:

from flask import Flask, render_template, request, redirect, url_for, make_response, session from flask_executor import Executor from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerRunner from twisted.internet import reactor from urllib.parse import urlparse from uuid import uuid4 import urllib3, requests, urllib.request, urllib.parse, sys  app = Flask(__name__) app.secret_key = uuid4().__str__() executor = Executor(app)  http = urllib3.PoolManager() runner = CrawlerRunner() id = uuid4().__str__() has_done = False  list = set([]) list_validate = set([]) list_final = set([])   @app.route('/', methods=["POST", "GET"]) def index():      if request.method == "POST":         url_input = request.form["usr_input"]          if url_input == '':             return render_template('index.html', error_display="block", input=url_input)          elif url_input != '':              if 'https://' in url_input and url_input[-1] == '/':                 url = str(url_input)             elif 'https://' in url_input and url_input[-1] != '/':                 url = str(url_input) + '/'             elif 'https://' not in url_input and url_input[-1] != '/':                 url = 'https://' + str(url_input) + '/'             elif 'https://' not in url_input and url_input[-1] == '/':                 url = 'https://' + str(url_input)              try:                 response = requests.get(url)                 error = http.request("GET", url)                 if error.status == 200:                     parse = urlparse(url).netloc.split('.')                     base_url = parse[-2] + '.' + parse[-1]                     start_url = [str(url)]                     allowed_url = [str(base_url)]                      list.clear()                     list_validate.clear()                     list_final.clear()                     session["url_input"] = url_input                      class MyCrawler(CrawlSpider):                          name = "Spider"                         start_urls = start_url                         allowed_domains = allowed_url                         rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]                          def parse_links(self, response):                             base_url = url                             href = response.xpath('//a/@href').getall()                             list.add(urllib.parse.quote(response.url, safe=':/'))                             for link in href:                                 if base_url not in link:                                     list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))                             for link in list:                                 if base_url in link:                                     list_validate.add(link)                      def start():                         reactor.run(0)                     def start_crawl():                         d = runner.crawl(MyCrawler)                         def validate_links(d):                             for link in list_validate:                                 error = http.request("GET", link)                                  if error.status == 200:                                     r = urllib.request.urlopen(link)                                     url_type = r.headers.get_content_type()                                     if url_type == "text/html" or url_type == "application/pdf" or url_type == "application/xml":                                         list_final.add(link)                              if url in list_final:                                 list_final.remove(url)                             links_count = str(len(list_final) + 1)                              original_stdout = sys.stdout                             with open('templates/file.xml', 'w') as f:                                 sys.stdout = f                                 print(f'<!--Total Links: {links_count}-->')                                 for link in list_final:                                     print(link)                                 sys.stdout = original_stdout                                 f.close()                              print('Finished')                             global has_done                             has_done = True                         d.addCallback(validate_links)                      executor.submit_stored(id, start_crawl)                     executor.submit(start)                     return redirect(url_for('crawling', id=id))                  elif error.status != 200:                     return render_template('index.html', error_display="block", input=url_input)              except requests.ConnectionError as exception:                 return render_template('index.html', error_display="block", input=url_input)      else:         return render_template('index.html', error_display="none")   @app.route('/crawling-<string:id>a') def crawling(id):     url_input = session["url_input"]     if not executor.futures.done(id):         return render_template('start-crawl.html', refresh=True, website=url_input)     else:         executor.futures.pop(id)         unique_id = uuid4().__str__()         return redirect(url_for('validate', id=unique_id))   @app.route('/crawling-<string:id>b') def validate(id):     x = 5     while x == 5:         if has_done == False:             url_input = session["url_input"]             return render_template('start-crawl.html', refresh=True, website=url_input)         elif has_done:             return redirect(url_for('crawled', id=id))   @app.route('/crawled-<string:id>') def crawled(id):     global has_done     has_done = False     url_input = session["url_input"]     total_links = str(len(list_final) + 1)     return render_template('finish-crawl.html', number_links=total_links, website=url_input, file_id=id)   @app.route('/file-<string:file_id>') def sitemap(file_id):     template = render_template('file.xml')     response = make_response(template)     response.headers['Content-Type'] = 'application/xml'     return response    if __name__ == '__main__':     app.run(debug=True, threaded=True) 

I know my code isn't the best way of performing the task but it works. My question is how can I run Scrapy on Heroku? I have a requirements.txt that contains all necessary packages. I'm not sure, but I think that I need some kind of Heroku build pack. I saw a tutorial on how to deploy with Selenium to Heroku. In that video, the person installed certain build packs and configured variables in order for Selenium to work. I think there might be build packs required for Scrapy, but I don't know what they are. Thanks in advance to everyone.

Tutorial for deploying with Selenium: https://www.youtube.com/watch?v=Ven-pqwk3ec.

        
   
   

Lista de respuestas


Relacionados problema

2  Heroku Toolbelt install - encabezado más de 360 ​​caracteres  ( Heroku toolbelt install header over 360 characters ) 
Estoy tratando de instalar el cinturón de herramientas de Heroku en mi Ubuntu 12.04. Estoy recibiendo los siguientes errores cuando ejecuto el script del si...

0  No se puede instalar ni desinstalar Heroku en Ubuntu 18  ( Cant install or uninstall heroku on ubuntu 18 ) 
Estoy tratando de ejecutar Heroku en mi terminal con abcdefghijklmnheroku login , pero cada vez que me muestre este mensaje Install the Heroku CLI from ht...

1  Error de Heroku al instalar [duplicar]  ( Heroku error when installing ) 
Esta pregunta ya tiene respuestas aquí : ¿Cómo descargar e instalar Heroku? ...

0  La instalación de Heroku devuelve errores con JSON  ( Installation of heroku returns errors with json ) 
Como dice el título, he estado tratando de instalar Heroku. Sin embargo, tan pronto como ejecuto cualquier comandos de Heroku, obtengo esto a cambio: /usr/...

2  ¿Cómo me desinstalo Heroku?  ( How do i uninstall heroku ) 
He instalado Heroku usando wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh Por favor, dime cómo desinstalar Heroku. ...

18  ¿Cómo descargar e instalar Heroku?  ( How to download and install heroku ) 
He descargado Heroku de Heroku Toolbelt usando wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh en el terminal, que recomendaron en su siti...

0  Se eliminó la aplicación instalada con SNAP, pero todavía puedo ver algunos archivos "sobrantes"  ( Removed app installed with snap but i still can see some leftovers files ) 
Básicamente, eliminé la aplicación 'X' (con ABCDEFGHIJKLMNABCDEFGHIJKLMN0 ), pero sigo siendo el directorio ~/snap/x/current . ¿Hay una manera de encontra...

3  Obtención de error al instalar Heroku  ( Fetching error when installing heroku ) 
Intenté instalar Heroku en Ubuntu usando este script wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh y obtengo este error al instalar el p...

1  Problemas de actualización  ( Problems updating ) 
Soy bastante nuevo en Ubuntu, a partir del 17.10 y ahora estoy en 18.04. En mi computadora portátil tengo 16.04 que se actualizan perfectamente, pero en mi ...

-1  ¿Cómo elimino el repositorio de la plataforma de aplicaciones en la nube de Heroku?  ( How do i remove the heroku cloud application platform repository ) 
Estoy teniendo problemas con Heroku Cloud Platform. Estoy teniendo problemas con ABCDEFGHIJKLMNABCDEFGHIJKLMN1 En este directorio se ha guardado el archiv...




© 2022 respuesta.top Reservados todos los derechos. Centro de preguntas y respuestas reservados todos los derechos