將數據壓縮84%,在壓縮數據上索引的一種新思路,PiXiu方法

幾個月前,我看到了如何評價國產高性能存儲引擎 TerarkDB ?,作者表示又能壓縮又能索引。我自詡對於B-Tree以及各種Tree玩得很6,但這種思路確實新穎啊!我也腦洞了一個低配TerarkDB,混合了後綴樹,CritBit Tree,替罪羊樹三種演算法。@雷鵬 你怎麼看?這也是我寫的第一個C++項目,真的比帶GC的語言辛苦多了。一開始我只打算學C的,也是用C寫的,導致轉C++時,RAII什麼的都沒用上。

技術性描述與實現可以直接跳轉到JimChengLin/PiXiu。這裡,我多發點圖片資料以及如何測得84%爆炸壓縮率。

  • gcov的覆蓋率,100%

v1是引用庫的覆蓋率

  • 隨機測試沒有內存泄漏與越界

接下來解釋,我怎麼測得84%的壓縮率

我用如下Python腳本,以QQ的首頁為根,抓取了1W個網頁

import refrom asyncio import get_event_loop, ensure_future, sleepfrom collections import dequefrom concurrent.futures import ThreadPoolExecutorimport requestsfrom bs4 import BeautifulSoupHEADERS = {user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36}ascii_re = re.compile([u0000-u007F]+)space_re = re.compile(s+)counter = 0done = set()q = deque(maxlen=10000)q.append(http://www.qq.com/)done.add(http://www.qq.com/)loop = get_event_loop()thread_pool = ThreadPoolExecutor(30)async def get_page(url): def req(): try: return requests.get(url, headers=HEADERS, timeout=5).text except requests.RequestException: return "" except Exception as e: print(e) return "" return await loop.run_in_executor(thread_pool, req)async def job(): global counter try: url = q.popleft() text = await get_page(url) if text == : return res = .join(ascii_re.findall(text)) res = space_re.sub(, res) with open(fdat/{counter}.txt, w, encoding=ascii) as f: f.write(url) f.write(::) f.write(res[:60000]) counter += 1 soup = BeautifulSoup(text, html.parser) i = 0 for link in soup.find_all(a): if i >= 40: break href = link.get(href) if href is not None and (href.startswith(http://) or href.startswith(https://)) and href[:30] not in done: i += 1 done.add(href[:30]) q.append(href) print(fOK {counter}th) except Exception as e: print(e)async def main(): jobs = [] await job() while q: if counter >= 200_0000: break if len(jobs) < 30: jobs.append(ensure_future(job())) else: jobs[:] = [i for i in jobs if not i.done()] if len(jobs) < 30: continue else: await sleep(1)loop.run_until_complete(main())

得到

使用這段C++代碼寫入

PiXiuCtrl ctrl; ctrl.init_prop(); long total_size = 0; long total_diff = 0; for (int i = 0; i < 10000; ++i) { std::ifstream ifs(std::to_string(i) + ".txt"); std::string cmd((std::istreambuf_iterator<char>(ifs)), (std::istreambuf_iterator<char>())); auto pos = cmd.find("::"); auto k = cmd.substr(0, pos); auto v = cmd.substr(pos); ctrl.setitem((uint8_t *) k.c_str(), (int) k.size(), (uint8_t *) v.c_str(), (int) v.size()); int diff = (int) cmd.size() - ctrl.st.cbt_chunk->getitem(ctrl.st.local_chunk.used_num - 1)->len; std::cout << "Curr Saving is " << diff << std::endl; total_diff += diff; std::cout << "Total Saving is " << total_diff << std::endl; total_size += cmd.size(); std::cout << "Total Size is " << total_size << std::endl << std::endl; } ctrl.free_prop();

就得到了84%的壓縮率!

由此證明兩點,

1. 後綴樹真厲害

2. HTML的冗餘太嚴重了!CDN應該購買我的代碼(雖然是BSD協議發布,但可以捐獻給我啊~),來節約內存佔用。

我還提供了一個簡單的小程序供各位把玩

推薦閱讀:

《未來簡史》:無遠弗屆的思考,實在當下的生活
求長度小於 1000 的字元串的最長子序列的思路應該是怎樣的?
情人節,用C++畫一個心:)
Leetcode 72 編輯距離
大公司筆試面試有哪些經典演算法題目?

TAG:算法 | 数据库 | CC |