Scraping DeviantArt

Playing with Python
Contents

DeviantArt is the biggest social network for artists. There I follow thousands of very talented artist. The only thing missing is an easy way to download multiple artworks. To fix this I wrote a small python script.

Dependencies

First of all, let’s import all dependencies. selenium is for loading all dynamic elements generated with JavaScript on the page. Then BeautifulSoup will parse all HTML data. To speed the process we’ll use Thread and Queue to download the images in parallel using requests.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup
from queue import Queue
from threading import Thread, Lock
import collections
import datetime
import time
import os
import pathlib
import requests
import subprocess

Next, let’s Initialize our threads and variables.

images  = []
img_num = 0
workers = 20
threads = []
tasks   = Queue()
lock    = Lock()

And print some useful information at the beginning.

def welcome_message():
    now = datetime.datetime.now()
    today = now.strftime("%A • %B %e • %H:%M • %Y")
    print('\n  DeviantArt Scraper')
    print('\n  DATE:  ' + today)

Get the Selenium Driver

Our driver will use headless Chromium. If you want to see what selenium do comment out options.add_argument('--headless').

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    driver = webdriver.Chrome(chrome_options=options)
    return driver

Get the artists username.

def get_username(d):
    global username
    html = d.page_source
    soup = BeautifulSoup(html, 'html.parser')
    username = soup.find(class_='gruserbadge').find('a').get_text()

All artwork a user has uploaded can be found on their gallery page that contains thumbnails for each image located at https://<username>.deviantart.com/gallery/. By following a link for a thumbnail you go to the designated page for that image at https://<username>.deviantart.com/art/<image-id>. From there you can download the image in its original resolution. Our goal it the get the image with the highest.

To achieve this we need to scroll the gallery page down until it reaches the bottom while we collect those thumbnail links. Before putting the links in our queue we’ll also discard all duplicates. We’ll use heywyd as our test artist.

def get_thumb_links(q):
    d = get_driver()
    d.get('https://heywyd.deviantart.com/gallery/')
    unique_img = scroll_page_down(d)
    time.sleep(0.5)
    for img in unique_img:
        q.put(img)
    global expected_img_num
    expected_img_num = str(len(unique_img))
    get_username(d)
    print('  Unique images found = ' + expected_img_num)
    print('  Artist = ' + username + "\n")
    time.sleep(0.5)
    d.close()

Scroll the page down and collect the links.

def scroll_page_down(d,q):
    SCROLL_PAUSE_TIME = 1.5

    # Get scroll height

    last_height = d.execute_script("return document.body.scrollHeight")
    while True:
        # Scroll down to bottom

        d.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page

        time.sleep(SCROLL_PAUSE_TIME)
        # Calculate new scroll height and compare with last scroll height

        new_height = d.execute_script("return document.body.scrollHeight")
        # Get the tumbnail image links

        im = d.find_element_by_class_name('folderview-art')
        links = im.find_elements_by_class_name('torpedo-thumb-link')
        for link in links:
            l = link.get_attribute('href')
            images.append(l)
        # Remove duplicates

        unique_img = list(set(images))
        time.sleep(0.5)
        # Break when the end is reached

        if new_height == last_height:
            break
        last_height = new_height
    return unique_img

Get the Full Resolution Images

Now having the links to the designated page for each image we can find the one with the highest resolution and download it with requests. There are 2 cases:

  • There is a green download button on the page under class="dev-page-download". Following it will open a new page containing the original image.
  • There isn’t such button. Then the link containing the original image can be found on the same page under class="dev-content-full".

The original images are found under https://orig00.deviantart.net/.

def get_full_image(l):
    s = requests.Session()
    h = {'User-Agent': 'Firefox'}
    soup = BeautifulSoup(s.get(l, headers=h).text, 'html.parser')
    title = ''
    link = ''
    try:
        link = soup.find('a', class_='dev-page-download')['href']
    except TypeError:
        try:
            link = soup.find('img', class_='dev-content-full')['src']
            title = soup.find('a',
                                 class_='title').text.replace(' ', '_').lower()
        except TypeError:
            try:
                link = age_restricted(l)
            except (WebDriverException, AttributeError):
                link = age_restricted(l)
        pass
    req = s.get(link, headers=h)
    time.sleep(0.1)
    download_now(req,title)
    # Return tuple

    url = req.url
    ITuple = collections.namedtuple('ITuple', ['u', 't'])
    it = ITuple(u=url, t=title)
    return it

Some images are age restricted and we are prompted to enter our age. To fill and submit the form we need to use selenium again.

def age_restricted(l):
    d = get_driver()
    d.get(l)
    time.sleep(0.8)
    d.find_element_by_class_name('datefields')
    d.find_elements_by_class_name('datefield')
    d.find_element_by_id('month').send_keys('01')
    d.find_element_by_id('day').send_keys('01')
    d.find_element_by_id('year').send_keys('1991')
    d.find_element_by_class_name('tos-label').click()
    d.find_element_by_class_name('submitbutton').click()
    time.sleep(1)
    img_lnk = d.find_element_by_class_name('dev-page-download')
    d.get(img_lnk.get_attribute('href'))
    time.sleep(0.5)
    link = d.current_url
    d.close()
    return link

Image names have a random sequence of 7 numbers and letters at the end. We’ll format the filename to exclude them and the rest of the URL leaving only the original name. The images not obtained using the download button usually have meaningless names. For their filename, we take the artworks name on their designated page.

def name_format(url,title):
    if url.find('/'):
        name =  url.rsplit('/', 1)[1]
        p1 = name.split('-')[0]
        p2 = name.split('-')[1].split('.')[1]
        name = p1 + '.' + p2
    if title != '':
        name = title + '.png'
    return name

Finally, we can download the images. They will be saved in a new folder with the name <username>.deviantart.com.

def download_now(req,title):
    url = req.url
    name = name_format(url,title)
    pathlib.Path('{}.deviantart.com'.format(username)).mkdir(parents=True,
                                                             exist_ok=True)
    with open(os.path.join('{}.deviantart.com/'.format(username),
                                               '{}'.format(name)),'wb') as file:
        file.write(req.content)

Saving the original full resolution image URLs to a file may be useful if we want to redownload the images again. They are saved under <username>-gallery.txt.

def save_img(url):
    try:
        with open('{}-gallery.txt'.format(username), 'a+') as file:
            file.write(url + '\n')
    except:
        print('An write error occurred.')
        pass

To wrap everything we need to create a worker thread. Each thread will be responsible for downloading 1 image at a time. How many images we download simultaneously depends on how many threads we decide to use - in our case 20.

def worker_thread(q, lock):
    while True:
        link = q.get()
        if link is None:
            break
        p = get_full_image(link)
        url = p.u
        title = p.t
        name = name_format(url, title)
        with lock:
            global img_num
            img_num += 1
            save_img(url)
            print('Image ' + str(img_num) + ' - ' + name)
        q.task_done()

Run the Script

All that’s left is to define your main function. It will display the welcome message and start with 20 threads. After downloads are completed it will report time elapsed, the number of images downloaded and the space occupied by them.

def main():
    welcome_message() # Display Welcome Message

    start = time.time()
    get_thumb_links(tasks) # Fill the Queue

    # Start the Threads

    for i in range(workers):
         t = Thread(target = worker_thread, args = (tasks, lock))
         t.start()
         threads.append(t)
    # When done close worker threads

    tasks.join()
    for _ in range(workers):
        tasks.put(None)
    for t in threads:
        t.join()
    # Print Stats

    folder_size = subprocess.check_output(['du','-shx',
             '{}.deviantart.com/'.format(username)]).split()[0].decode('utf-8')
    print('\n  Total Images: ' + str(img_num) + ' (' + str(folder_size) + ')')
    print('  Excepted: ' + expected_img_num)
    end = time.time()
    print('  Elapsed Time: {:.4f}\n'.format(end-start))

Run main.

if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        print()

View the whole script here. You can also find it on GitHub.

Disclaimer: All art you download using this script belongs to their rightful owners. Please support them by purchasing their art.


Thank you for reading my article. More on the topic can be found in these sources.