Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
720 views
in Technique[技术] by (71.8m points)

python - Unable to communicate with API

I'm trying to scrape the following website

I noticed there's an XHR POST request done to the following EndPoint during navigating between pages as you can see within the following print screen:

enter image description here

The thing which am stuck in within the POST request, i noticed there's a dynamic value came after GBK- but i don't understand from where it's generated or how to obtain it.

if you just keep navigate between pages, you will notice the value keep changes.

An update according to the below answer by Life is complex:

this is how to send a POST request to the API:

import requests

# we need the value!
url = "http://app1.nmpa.gov.cn/data_nmpa/face3/search.jsp?6SQk6G2z=GBK-Value"

# here you can add headers as you need!
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "cache-control": "no-cache",
    "Content-Type": "application/x-www-form-urlencoded",
    "Pragma": "no-cache"
}

data = {
    "tableId": "27",
    "State": [
        "1",
        "1",
        "1",
        "1",
        "1",
        "1",
        "1"
    ],
    "bcId": "152904442584853439006654836900",
    "tableName": "TABLE27",
    "viewtitleName": "COLUMN200",
    "viewsubTitleName": "COLUMN192,COLUMN199",
    "curstart": "2",  # here is the Page number
    "tableView": "进口医疗器械产品(注冿"
}

r = requests.post(url, headers=headers, data=data)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

CONCLUSION - 07-25-2021


After looking at this problem in more detail, I believe that it is NOT technically possible to use Python Requests to scrape the website and table in your question.

Which means that your question cannot be solved in the manner that you would prefer.


Why?

The website employs anti-scraping mechanisms.

The GBK values are only one part of these mechanisms. The table that you're trying to scrape has 1504 pages. A new unique GBK value is created each time you navigate from page 1 to the next page and so forth. Thus there are 1503 unique GBK value.

The site also uses a unique session management cookie for each page.

# page 1 cookie
JSESSIONID=0AC56294FE6857A236F0E68A9106E1AE.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPEbke3aQqqqm_6QLwIaUKu0tMygss.En464jhvNz1mMzbOatzmLLtv9x_xiCP6JaO_JzcbvHqtsnQYydBa6B_YjSg6sFm7cVBBOhB35_.TZuwDsbOnDinJkNwMs3AaMPtM83dP9YnogFKHpNJo5.RHMTKT6_XNPr0mxebR6stRrQ7LFfACcWqHHhbc.j6gZfZzxsgwnPE3RGP6aT9nYuMJbvK2EGrdAv0O12G03KTk_BMk.xLeEwrQq5VjyH1tB7t4wQ.jQ1geshvbDPCs8_VHCkd2.6uIag5Md.lngzeDshhSjMrmBjyy0HTqAXQ3; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583

# page 2 cookie
JSESSIONID=651AD12FD349FFB1842E08CA578EA37D.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6Q_YEqK9dBPNnJQF00YvHDMLHlJeb.4rrpTsgfwZxU0S5OXIAB2aduoOTmj7RuKIL.LUXRaRqfh5ZByuTFX3LxK1Ia3sr3V45c.PPx6Eas5EF5EkQztquzrX78QIbjrJUcQoKoOKcqgX5UuRIN0gCyGDyI6FFj.JbPhwYf65Hcx9BzDQnrlGAPHM3WGvmKf7OJnLY1SGIuxtdyVUE359Ll2lr0QJxUq1Dacqz_WsFa_ZantBbP7MklHX6J21wmDnyo6s4xCeeTYwsGq.kGUbE74Dx.QjQBCM_SiLKccTog8_EdBDg; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583

# page 3 cookie
JSESSIONID=2121D74E0EFCEC3BE104DAA2791481B6.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6wBfEGBZsTF9_uGtgepzPXNOzFh0RNtGcE1Cf4hEQNppVywcI5mk3SlLkzvNll6ovr4XmfL2Ujy3AFZR5leVY2H3_584At3GmIwmnsEjOx5v5e_lMon3AbX9t2W8UiLoK.9SBX0vgNRfkqdpyPjWKk3Zs8gQG0k3_6UwxGTvEwWkaWL8vquJgCGlvLEFTjNvd07eHiR482UfpLPFP6yAkx8Wi9pM79cL.26KE3U2L79hgBKLHyOdNyj3VKOkDsaXefNdPXd.YqT4kevShGxzMM2PuzqnuuQnW.GQ5mr9Rx8VxUjEa; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583

So you would need to acquire both the unique GBK value and the unique session management cookie for each page between 2 and 1504.

I also noted that site employs some type of latency. The first page can take some time to fully load. If you attempt to navigate to another page before this page load is complete you will get this message "请勿频繁操作!"

It's worth nothing that some pages took up to 2 minutes to load. When they didn't load the message above was displayed.

Like I previously mentioned you should consider scraping this site with selenium, which might work to bypass the anti-scraping mechanisms.

UPDATE POST (SELENIUM)- 07-27-2021


I have attempted to scrape your target website with selenium. The chromedriver continually fails to connect to the site. Even if these switches are used:

chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

I also tried using the undetected-chromedriver, but that also failed to property connect.

Even when I set a high timeout period with either driver.set_page_load_timeout() or driver.implicitly_wait() the session still fails with chromedriver.

I was able to access the website when I used selenium with the geckodriver. There are still time out issues, but adding a WebDriverWait couple with an expected_conditions seems to overcome some of the timeout issues.

Scraping this website will be a long and arduous process, because of the anti-scraping mechanisms being used.

ORIGINAL POST - 07-24-2021


First let me say, that your question doesn't have enough details to make a solid recommendation for solving your issue.

I looked into your problem. I found that your target website uses some Javascript to dynamically create the GBK value that is used each post request.

6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ

enter image description here

This GBK value is used when navigating between the pages 1 through 1504. I noted that the value changes for each page.

enter image description here

import difflib

# page 2
a = "6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ"

# page 1504
b = "6SQk6G2z:GBK-59tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5ZljqD6daOH4pxgaFcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6EBfd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9Jp0eN2yDTZljeX0XWWOxIjEkdj89FOqaNDB2slUE.54oC96baGe7lttoz_2AoTbjHSTjfDh.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAKs1c6r6cE"


s = difflib.SequenceMatcher(None, a, b, autojunk=False)
for tag, i1, i2, j1, j2 in s.get_opcodes():
    if tag != 'equal':
        print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
            tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))

        # output 
        insert    a[14:14] --> b[14:49]       '' --> '9tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5Z'
        replace   a[15:18] --> b[50:55]    'kb7' --> 'jqD6d'
        replace   a[19:24] --> b[56:60]  'cLMDD' --> 'OH4p'
        delete    a[25:49] --> b[61:61] 'ywZsCHoJagJlT50f1gw4.jaV' --> ''
        insert    a[51:51] --> b[63:158]       '' --> 'FcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6E'
        insert    a[52:52] --> b[159:243]       '' --> 'fd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9J'
        insert    a[53:53] --> b[244:276]       '' --> '0eN2yDTZljeX0XWWOxIjEkdj89FOqaND'
        replace   a[54:55] --> b[277:291]      'c' --> '2slUE.54oC96ba'
        replace   a[56:57] --> b[292:311]      'Z' --> 'e7lttoz_2AoTbjHSTjf'
        insert    a[58:58] --> b[312:369]       '' --> 'h.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAK'
        delete    a[60:63] --> b[371:371]    'T_p' --> ''
        delete    a[64:74] --> b[372:372] 'R_OPFgvOm_' --> ''
        replace   a[75:84] --> b[373:374] 'oM8PfyL3L' --> 'r'
        replace   a[85:99] --> b[375:376] 'xDPxFqgIqgwbVA' --> 'c'
        delete    a[100:377] --> b[377:377] 'w8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ' --> ''

The GBK value is created with this call in the HTML of the page.

javascript:commitForECMA(callbackC,"content.jsp?tableId=27&tableName=TABLE27&tableView=杩涘彛鍖荤枟鍣ㄦ浜у搧锛堟敞鍐?&Id=60456",null)

This is the Javascript that is called.

function commitForECMA($_17, $_12, $_19) {
    request = createXMLHttp();
    request.onreadystatechange = $_17;
    if ($_19 == null) {
        _$b6(request, _$JI('ZM6r2MG'), _$JI("Op0YV"), $_12);
        request.setRequestHeader(_$JI("RACeXwDYXwcTV8Ur2"), _$JI("9wDYgwceLwDT7iCYX3Ce9FKyvHKwPFa"));
    } else {
        var $_16 = "";
        var $_11 = $_19.elements;
        var $_14 = $_11.length;
        for (var $_4 = 0; $_4 < $_14; $_4++) {
            var $_6 = _$kH($_11, $_4);
            if ($_6.type != _$JI("aQ6YPMK20") && _$kH($_6, _$JI('Cwbm7wKV')) != "") {
                if ($_16.length > 0) {
                    $_16 += "&" + $_6.name + "=" + _$kH($_6, _$JI('swbm7wKV'));
                } else {
                    $_16 += $_6.name + "=" + _$kH($_6, _$JI('8wbm7wKV'));
                }
                $_16 += _$JI("xx2J03Up2Hsl");
            }
        }
        _$b6(request, _$JI('iM6r2MG'), _$JI("IVlesYq"), $_12);
        $_16 = encodeURI($_16);
        $_16 = encodeURI($_16);
        request.setRequestHeader(_$JI("53CmOFDVz3CeXwoxBMq"), _$JI("wMbZz3CmOFDV"));
        request.setRequestHeader(_$JI("ZACeXwDYXwcTV8Ur2"), _$JI("F3UraMD2O3UpNMCgB8cT6w6QzRbenM1TTQbS2MbJBRDY9"));
    }
    request.send($_16);
    if ($_19 != null) {
        $_19.reset();
    }
}


truncated....


function createXMLHttp() {
    if (window.XMLHttpRequest) {
        return new XMLHttpRequest();
    } else if (window.ActiveXObject) {
        var $_17 = [_$JI("5sYJ3sVanh2fJslf0woqXJ1ga"), _$JI("osYJ3sVanh2fJslf0woqXJcga"), _$JI("ZsYJ3sVanh2fJslf0woqXWnga"), _$JI("fsYJ3sVanh2fJslf0woq"), _$JI("3sK2OQbeuMCR0h2fJslf0woq")];
        for (var $_16 = 0; $_16 < $_17.length; $_16++) {
            try {
                return new ActiveXObject(_$kH($_17, $_16));
            } catch ($_19) {}
        }
        throw new Error("您的浏览器不支持访问此网页");
    }
}


truncated....


function callback() {
    if (request.readyState == 1) {
        _$_J(document.getElementById(_$JI("x3CeXwDYXwq")), '=', _$JI('3FKyXRUxEYlTW'), _$JI("EHDxnHOaB3vE5HDxnHOSNMKQGQ6xOHK2z3Kw2Qne7MCm9FKyvhbwNROg"));
    }
    if (request.readyState == 4) {
        if (request.status == 200) {
            oldContent.length = 0;
            oldContent[0] = request.responseText;
            _$_J(document.getElementById(_$JI("H3CeXwDYXwq")), '=', _$JI('OFKyXRUxEYlTW'), request.responseText);
            request = null;
        } else {
            _$_J(document.getElementById(_$JI("w3CeXwDYXwq")), '=', _$JI('eFKyXRUxEYlTW'), "<br><br><br><span style=font-size:x-large;color:#215add>请勿频繁操作!</span>");
      

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...